Perplexity AI accused of covertly scraping websites, defying no-crawl rules
In a significant escalation of the ongoing battle over AI content scraping, internet infrastructure giant Cloudflare has publicly accused AI search startup Perplexity AI of employing "stealth crawlers" to bypass website restrictions and illegally harvest data. The allegations, detailed in a Cloudflare report released on Monday, August 4, 2025, suggest Perplexity's bots are actively disguising themselves to flout widely accepted web protocols, including robots.txt
directives.
Cloudflare's investigation, prompted by complaints from its customers, revealed that even when websites implemented robots.txt
files and specific firewall rules to block Perplexity's officially declared crawlers (such as PerplexityBot
), content was still being accessed by the AI service. According to Cloudflare, Perplexity's systems appeared to switch to undeclared bots that mimicked legitimate web browser traffic, frequently rotated IP addresses, and altered user agents to evade detection. Cloudflare engineers likened this behavior to "adaptive malware" and Cloudflare CEO Matthew Prince controversially compared the tactics to those used by "North Korean hackers."
Perplexity AI, a search engine backed by investors like Jeff Bezos, synthesizes responses from web content and provides citations, aiming for transparent and factual information retrieval. However, a spokesperson for Perplexity, Jesse Dwyer, dismissed Cloudflare's claims as misleading, stating that "no content was actually accessed" and suggesting the traffic in question did not originate from their systems. This response comes amidst a history of similar accusations against the AI firm.
This is not Perplexity AI's first encounter with allegations of aggressive scraping. In June 2024, Forbes publicly criticized the company for allegedly copying an entire article, including illustrations, with minimal attribution. Wired also reported in June 2024 that Perplexity was scraping content from sites that explicitly prohibited such actions and was observed inaccurately paraphrasing articles. Major media organizations have also taken legal steps; The New York Times issued a cease-and-desist notice in October 2024, and the BBC threatened legal action in June 2025, both accusing Perplexity of unauthorized content use and copyright infringement. Dow Jones and New York Post also filed a lawsuit in June 2024. Perplexity has generally maintained that it "aggregates" public information under what it believes to be fair use and is not training large language models from scratch but rather indexing the web for summaries.
In response to the growing issue of AI scraping, Cloudflare has taken proactive measures. The company has delisted Perplexity AI as a "verified bot" and updated its systems to actively block these "stealth crawling" activities. Cloudflare also offers tools for website owners to easily block AI training crawlers and has even introduced an "AI Labyrinth" feature in March 2025, designed to trap misbehaving bots in a maze of AI-generated junk content, wasting their resources and deterring unauthorized scraping. Cloudflare's CEO has emphasized the need for AI firms to adopt ethical standards, warning that continued evasion could lead to broader blocks.
The dispute underscores a fundamental tension in the AI era: AI developers require vast amounts of data to train their models, while content publishers seek to control and monetize their intellectual property. While robots.txt
has long served as a voluntary protocol for web crawlers, the ethical and legal implications of ignoring these directives for AI training and content generation remain a hotly debated topic, potentially accelerating calls for industry regulation and new legal frameworks.