Cloudflare Accuses Perplexity of Covert AI Scraping Tactics

Knowtechie

In a rapidly escalating dispute, internet infrastructure giant Cloudflare has publicly accused AI search engine Perplexity of employing “stealth crawling” tactics to bypass website restrictions and scrape content. The allegations, detailed in a research post published by Cloudflare on Monday, August 5, 2025, have ignited a fresh debate over the ethics of AI data collection and the control content creators have over their digital assets.

Cloudflare’s claims stem from an investigation initiated after numerous customers reported that Perplexity’s AI bots were still accessing their websites despite explicit blocks via robots.txt files and other network-level rules. According to Cloudflare, Perplexity’s crawlers, initially identifying themselves with standard user agents like “PerplexityBot,” would reportedly obscure their identity when faced with a network block, attempting to circumvent website preferences.

The alleged tactics include impersonating legitimate browsers, such as Google Chrome on macOS, and rotating IP addresses and Autonomous System Numbers (ASNs) to evade detection. Cloudflare’s researchers observed this activity across “tens of thousands of domains and millions of requests per day,” operating outside Perplexity’s officially declared IP ranges. To substantiate its findings, Cloudflare even created test domains configured to deny bot access, which Perplexity’s crawlers reportedly still managed to access and retrieve information from. Cloudflare CEO Matthew Prince went as far as likening Perplexity’s alleged actions to those of “North Korean hackers.” In response to its findings, Cloudflare has removed Perplexity from its list of verified bots and implemented new managed rule heuristics to detect and block such stealth crawling across its network.

Perplexity, however, has vehemently denied the accusations, dismissing Cloudflare’s report as a “sales pitch.” Jesse Dwyer, a spokesperson for Perplexity, asserted that the bot identified by Cloudflare was not associated with their company and claimed that the screenshots provided by Cloudflare did not demonstrate any actual content access. Perplexity argues that Cloudflare fundamentally misunderstands the operational model of modern AI assistants. The AI startup stated that its platform relies on “user-driven agents” that fetch content only when a user poses a specific question requiring real-time information, emphasizing that this fetched data is neither stored nor used for training AI models. Furthermore, Perplexity accused Cloudflare of misattributing automated traffic from a third-party service, BrowserBase, to its systems, calling it a “basic traffic analysis failure.”

This high-profile dispute underscores the growing tension between AI companies, which depend on vast amounts of web data for their functionalities, and website operators striving to maintain control over their intellectual property and content distribution. The reliance of AI tools on Retrieval Augmented Generation (RAG) means a continuous need for current information, which some publishers view as a “revenue-threatening parasitic relationship.” Ethical considerations surrounding AI data sourcing, transparency in bot behavior, and adherence to web standards like robots.txt are at the forefront of this debate. Cloudflare recently launched its “Content Independence Day” initiative, aimed at empowering over 2.5 million websites to block AI training crawlers and assert greater control over their content. This is not the first time Perplexity has faced scrutiny over its content acquisition practices, with previous allegations including plagiarism and bypassing paywalls. The ongoing controversy highlights the complex challenge of balancing AI innovation with the rights and preferences of web publishers in the evolving digital landscape.