Cloudflare vs. Perplexity: AI Web Scraping War & Legal Risks
The digital landscape is currently grappling with a significant dispute between internet infrastructure giant Cloudflare and AI startup Perplexity, centered on allegations of illicit web scraping. This “web scraping war” has profound implications for the future of artificial intelligence development, content monetization, and the very ethics of data acquisition in the digital age.
Cloudflare initiated the public discussion on August 4, 2025, with a blog post accusing Perplexity, an AI-powered “answer engine,” of bypassing robots.txt
restrictions to scrape content. Robots.txt
files are a long-standing web standard, introduced in 1994 and formally standardized in 2022, allowing websites to signal whether they want their content indexed by search engines or AI crawlers. Cloudflare alleges that Perplexity initially uses its declared user agents (like PerplexityBot), but when blocked, it resorts to “stealth crawling” by obscuring its identity, modifying user agents, changing source IP addresses, and sometimes even failing to fetch robots.txt
files altogether. This behavior, Cloudflare claims, is incompatible with established web “netiquette” and ethical standards that have historically governed internet interactions. Cloudflare’s investigation was prompted by numerous complaints from its customers who had explicitly disallowed Perplexity’s crawling activity in their robots.txt
files and implemented Web Application Firewall (WAF) rules, yet still found their content being accessed by Perplexity. Cloudflare has since de-listed Perplexity as a “verified bot” and implemented new rules to block its stealth crawling.
Perplexity has vehemently denied Cloudflare’s accusations, calling their analysis “embarrassing” and “disqualifying.” Perplexity argues that Cloudflare’s systems are “fundamentally inadequate for distinguishing between legitimate AI assistants and actual threats.” The AI startup asserts that its system operates fundamentally differently from traditional web crawlers; instead of systematically indexing vast portions of the web, it fetches webpages only in response to specific user questions, acting as a “user-triggered agent.” Perplexity claims it does not store or index content ahead of time and does not retain or use the fetched content for training its models.
This dispute is not an isolated incident for Perplexity. The company is already embroiled in legal battles with major publishers. In October 2024, Dow Jones (parent company of The Wall Street Journal and New York Post) filed a lawsuit against Perplexity, alleging “massive scale” copyright infringement by copying their content to build its Retrieval Augmented Generation (RAG) index. The lawsuit claims this practice allows Perplexity users to “skip the links” and directly access summaries, thereby reducing traffic and revenue for publishers. Similarly, the BBC sent a letter to Perplexity in June 2025, threatening legal action for scraping its content without permission and demanding compensation or deletion of data. The BBC claims to have evidence that Perplexity’s model was trained using its content and that parts of its content were reproduced verbatim, directly competing with its services. Perplexity, in turn, labeled the BBC’s claims as “manipulative and opportunistic” and indicative of a “fundamental misunderstanding” of technology and intellectual property law. Despite these legal challenges, Perplexity has also engaged in revenue-sharing deals with some publishers, including Time, Fortune, and Der Spiegel, in an attempt to address content concerns.
The broader implications of this “web scraping war” are significant for the evolving relationship between AI developers and content creators. The rise of AI crawlers that summarize content without generating direct traffic or revenue for publishers threatens the web’s dominant business model. Cybersecurity researchers anticipate an escalating “arms race” between those protecting content and AI companies seeking data. While the legal limits of scraping content and bypassing robots.txt
remain unclear, Cloudflare’s findings could expose Perplexity to further lawsuits. This ongoing conflict underscores the urgent need for clear ethical guidelines and potentially new legal frameworks to govern how AI systems access and utilize online data, balancing innovation with the rights of content creators.