Cloudflare Accuses AI Firm Perplexity of 'Stealth Crawling'

2025-08-05T08:39:39.000ZTechrepublic

Cloudflare, a prominent internet infrastructure provider, has publicly accused AI startup Perplexity of engaging in "stealth crawling behavior" across millions of websites, reigniting a contentious debate over how AI firms access and utilize web content. The accusation, detailed in a recent Cloudflare blog post, alleges that Perplexity's bots bypass established website restrictions, including robots.txt files and firewall rules, to scrape content.

According to Cloudflare, Perplexity's crawlers initially use declared user agents, but when faced with network blocks or robots.txt disallow directives, they allegedly switch to undeclared, generic browser signatures and rotate IP addresses to evade detection. This behavior was observed across tens of thousands of domains and millions of requests per day, with Cloudflare utilizing machine learning and network signals to fingerprint the stealthy activity, including instances where the bots impersonated popular web browsers like Google Chrome on macOS. Cloudflare's findings stemmed from complaints by customers who noticed Perplexity still accessing their content despite having explicit blocks in place.

The robots.txt file is a widely adopted web standard that provides instructions to web robots, such as search engine crawlers, about which parts of a website they are permitted to access. Cloudflare asserts that Perplexity's actions are in direct conflict with these web crawling norms, which emphasize transparency and adherence to website directives. As a result, Cloudflare has de-listed Perplexity as a verified bot and updated its rules to block such stealth activity, offering its customers enhanced protection against these undeclared crawlers.

In response to Cloudflare's allegations, Perplexity has strongly pushed back, characterizing Cloudflare's leadership as "either dangerously misinformed on the basics of AI, or simply more flair than cloud". Perplexity clarified in a post that its AI agents operate differently from traditional web crawlers. The company states that when a user asks a question requiring current information, its AI goes to relevant websites, reads the content, and provides a tailored summary, emphasizing that this content is not stored for training purposes but used immediately to answer the user's query. Perplexity also suggested that Cloudflare might be confusing its legitimate traffic with unrelated requests from third-party services like BrowserBase.

This dispute highlights a growing tension within the digital ecosystem, where AI companies require vast amounts of data for their models, while content creators and publishers seek to control how their intellectual property is accessed and monetized. The effectiveness of robots.txt as a voluntary protocol is increasingly being questioned in the age of AI, leading to calls for more robust mechanisms for content owners to express their preferences regarding AI data usage. Cloudflare's recent "Content Independence Day" initiative, which allows over 2.5 million websites to block AI training crawlers, underscores the industry's shift towards providing greater control to content creators.

The incident with Perplexity is not isolated, with other AI firms like Anthropic facing similar accusations and legal challenges, including a lawsuit from Reddit over content scraping. While some AI companies like OpenAI are reportedly adhering to best practices and proposed standards for bot behavior, the current controversy emphasizes the ongoing need for clear ethical guidelines and technical solutions to manage AI-driven web crawling responsibly.

Cloudflare Accuses AI Firm Perplexity of 'Stealth Crawling' - OmegaNext AI News