Why did Cloudflare accuse Perplexity of stealth crawling?

Cloudflare alleged that when its official crawlers were blocked via robots.txt or WAF rules, Perplexity deployed alternate, undeclared crawlers that disguised their identity as regular users (Chrome on Mac) and rotated through unlisted IP addresses to access restricted content.

Home

Cloudflare Accuses Perplexity of ‘Stealth Crawling’ to Bypass Website Restrictions

byMorgan H

August 20, 2025

Cloudflare Accuses Perplexity of 'Stealth Crawling' Despite Robots.txt Restrictions

Major internet infrastructure provider blocks AI search engine over alleged deceptive crawling tactics that circumvent website protections

August 16, 2025 — A major controversy has erupted in the AI and web infrastructure world as Cloudflare, one of the internet’s largest security and content delivery networks, publicly accused AI search engine Perplexity of using “stealth crawling” techniques to bypass explicit website restrictions. The dispute has escalated into a public war of words, with implications that could reshape how AI companies access web content and how websites protect their data from automated harvesting.

Table of Contents

The Core Allegations: Sophisticated Evasion Tactics

Cloudflare, which protects an estimated 24 million websites, published a damning report on August 4, 2025, detailing what it characterized as systematic deception by Perplexity. The allegations center on Perplexity’s alleged use of undeclared crawlers that masquerade as regular browser traffic to access content from websites that have explicitly blocked its official bots.

According to Cloudflare’s investigation, Perplexity is repeatedly modifying their user agent and changing IPs and ASNs to hide their crawling activity, in direct conflict with explicit no-crawl preferences expressed by websites. The company’s analysis revealed that when Perplexity’s declared crawlers encounter robots.txt restrictions or network blocks, the AI search engine allegedly switches to deceptive tactics.

The Technical Evidence: A Web of Deception

Cloudflare’s engineers documented specific technical behaviors that they characterized as violations of web crawling norms:

User Agent Spoofing

When blocked, Perplexity allegedly uses a generic browser intended to impersonate Google Chrome on macOS. The specific user agent identified was:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36

This string is designed to blend in with normal human traffic, making it nearly impossible for standard filtering systems to distinguish from legitimate users.

IP Address Rotation and ASN Switching

Cloudflare observed that Perplexity’s stealth crawlers operated outside of the IP addresses in Perplexity’s official IP range, using addresses that came from different ASNs (Autonomous System Numbers) to evade address-based blocking. This activity was observed across tens of thousands of domains and millions of requests per day.

Robots.txt Violations

Most damaging to Perplexity’s reputation, Cloudflare found evidence that the company was ignoring—or sometimes failing to even fetch—robots.txt files, the web standard that tells crawlers which content they should and shouldn’t access.

The Smoking Gun: Controlled Experiments

To validate their suspicions, Cloudflare conducted sophisticated honeytrap experiments that appear to provide compelling evidence of misconduct:

Test Setup:

Created multiple brand-new domains (similar to testexample.com and secretexample.com)
Domains were newly purchased and had never been indexed by any search engine
Made content not publicly accessible in any discoverable way
Implemented robots.txt files with directives to stop all automated access
Added specific WAF rules blocking Perplexity’s declared crawlers

Results: Despite these protections, when Cloudflare queried Perplexity AI about these secret domains, the platform provided detailed information about the restricted content. As Cloudflare noted: “This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.”

The Control Test: OpenAI’s Compliance

To establish a baseline for proper behavior, Cloudflare ran identical tests with ChatGPT. The results were starkly different:

ChatGPT-User fetched the robots file and stopped crawling when disallowed
No follow-up crawls from other user agents or third-party bots were observed
When presented with block pages, ChatGPT stopped crawling entirely
No additional crawl attempts from alternative user agents occurred

This comparison demonstrated what Cloudflare called “the appropriate response to website owner preferences.”

Perplexity’s Fierce Counter-Attack

Perplexity responded with unprecedented aggression, publishing a blog post that accused Cloudflare of either technical incompetence or deliberate misrepresentation:

“Cloudflare’s recent blog post managed to get almost everything wrong about how modern AI assistants actually work. The technical errors in Cloudflare’s analysis aren’t just embarrassing — they’re disqualifying.”

Key Elements of Perplexity’s Defense:

1. Attribution Error Claims Perplexity alleged that Cloudflare “fundamentally misattributed 3-6M daily requests from BrowserBase’s automated browser service to Perplexity,” describing this as “a basic traffic analysis failure that’s particularly embarrassing for a company whose core business is understanding and categorizing web traffic.”

2. Methodology Criticism The company accused Cloudflare of obfuscating their methodology and declining to answer questions that would help Perplexity’s teams understand the allegations.

3. Publicity Stunt Allegations

“Cloudflare needed a clever publicity moment and we—their own customer—happened to be a useful name to get them one.”

4. Technical Distinction Argument Perplexity claimed that modern AI assistants work fundamentally differently from traditional web crawling, arguing that user-driven requests should be treated differently from automated scraping.

Industry Expert Reactions: A Divided Community

The controversy has split the tech community, with experts and observers taking sides in what some are calling the most significant web ethics dispute of 2025.

Defenders of Traditional Web Standards

Cloudflare CEO Matthew Prince didn’t mince words, posting on social media: “Some supposedly ‘reputable’ AI companies act more like North Korean hackers. Time to name, shame, and hard block them.”

Web infrastructure experts emphasize that the controversy strikes at the heart of web governance:

“The Internet as we have known it for the past three decades is rapidly changing, but one thing remains constant: it is built on trust. There are clear preferences that crawlers should be transparent, serve a clear purpose, perform a specific activity, and, most importantly, follow website directives and preferences.”

Defenders of AI Innovation

However, many came to Perplexity’s defense, arguing that the distinction between user-driven AI assistants and traditional bots is significant:

“If I as a human request a website, then I should be shown the content. Why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser?”

Another supporter argued: “The difference between automated crawling and user-driven fetching isn’t just technical — it’s about who gets to access information on the open web.”

The Broader Context: AI’s Data Hunger Crisis

This controversy emerges amid a broader crisis over AI companies’ voracious appetite for data and publishers’ attempts to protect their content:

Growing Bot Traffic

Recent studies reveal the scope of the challenge:

In March 2025, 26 million AI scrapes bypassed robots.txt files
RAG bot scrapes per site grew 49% from Q4 2024 to Q1 2025
The share of bots ignoring robots.txt files increased from 3.3% to 12.9% during Q1 2025
Around 30% of global web traffic in July was coming from bots

Revenue Threat to Publishers

The fundamental issue is what experts call “a revenue-threatening parasitic relationship” where AI tools access content without generating traffic or ad revenue for publishers. As one expert noted: “AI platforms typically don’t want to point to your site, they want to replicate it so users never have to go there.”

Cloudflare’s Nuclear Response: Complete Blocking

In response to their findings, Cloudflare took unprecedented action:

De-listing: Removed Perplexity from its “verified bot” program
Active Blocking: Added heuristics to managed rules that block Perplexity’s stealth crawling
Protective Measures: Enhanced bot detection capabilities to identify similar behaviors

The move affects millions of websites protected by Cloudflare’s services, potentially cutting off Perplexity’s access to a significant portion of the web.

Technical Solutions and Workarounds

The controversy has accelerated development of new technical standards and solutions:

Web Bot Auth Standard

Cloudflare is promoting the Web Bot Auth standard, developed by the Internet Engineering Task Force, which aims to create a cryptographic method for identifying AI agent web requests. OpenAI has already begun implementing this standard, while Perplexity has not.

Defensive Technologies

Cloudflare has developed several new tools to help publishers control AI access:

AI Labyrinth: A system designed to trap non-compliant bots
Pay Per Crawl: Allows sites to charge for access
Enhanced Bot Detection: Machine learning systems to identify disguised crawlers

Essential Strategies for Website Owners

Organizations must now navigate this complex landscape with specific technical approaches:

1. Implement Multi-Layered Bot Protection

Use robust robots.txt files with explicit disallow directives
Deploy WAF rules targeting specific AI crawlers
Monitor traffic patterns for suspicious user agents

2. Consider Cloudflare’s Enhanced Protection

Enable AI crawler blocking features
Use challenge-based systems to verify human users
Implement pay-per-crawl models if appropriate

3. Monitor and Analyze Bot Traffic

Track crawler visits using server logs
Identify patterns that suggest evasion techniques
Use fingerprinting techniques to identify disguised bots

4. Legal and Ethical Considerations

Clearly document crawling preferences in robots.txt
Consider legal action for persistent violations
Evaluate fair use implications of AI content usage

The Future: Establishing New Norms

This controversy is likely to establish precedents for how AI companies access web content and how website owners protect their intellectual property.

Potential Outcomes:

Regulatory Intervention: Government agencies may step in to establish binding standards for AI crawler behavior
Industry Self-Regulation: Tech companies may develop voluntary standards to avoid regulatory oversight
Technical Arms Race: Continued escalation between protection and evasion technologies
Business Model Evolution: New revenue-sharing models between AI companies and content creators

My Analysis: The Cloudflare-Perplexity controversy represents a watershed moment in the evolution of web governance. While Perplexity raises valid questions about the distinction between user-driven requests and automated crawling, Cloudflare’s evidence of systematic evasion tactics is difficult to dismiss. The technical sophistication of the alleged deception—including user agent spoofing, IP rotation, and ASN switching—suggests behavior that goes well beyond legitimate user representation.

The core issue isn’t whether AI should access web content, but whether it should do so transparently and with respect for website owners’ preferences. Perplexity’s alleged tactics undermine the trust-based system that has governed web crawling for decades, potentially setting a dangerous precedent for other AI companies.

Conclusion: The Battle for Web Governance

The Cloudflare-Perplexity dispute is more than a technical disagreement—it’s a fundamental battle over the future governance of the web. As AI companies’ hunger for data grows, the traditional systems that governed crawler behavior are being tested to their limits.

The outcome of this controversy will likely determine whether the web remains an ecosystem of mutual cooperation or devolves into a technological arms race between content protectors and data harvesters. For website owners, the message is clear: the old assumptions about crawler behavior no longer hold, and new defensive strategies are essential.

The broader question remains: Can the web maintain its foundation of trust in an era where AI’s commercial incentives may conflict with traditional norms of respect and transparency? The answer will shape the internet’s future for years to come.

As this controversy continues to unfold, it serves as a stark reminder that the rapid advancement of AI technology often outpaces the development of ethical frameworks and governance structures needed to manage its impact on existing digital ecosystems.