How to Block AI Scrapers
Without Hurting SEO
The web has changed. It used to be that you wanted every bot to index your site. You wanted Google, Bing, and Yahoo to see your content.
In 2025, a new wave of bots—from OpenAI, Anthropic, and CommonCrawl—are hitting servers not to send you traffic, but to scrape your data for model training. They eat your bandwidth, skew your analytics, and provide zero ROI.
Blocking them is tricky. If you block too aggressively, you might accidentally block Googlebot and kill your SEO. If you rely on User-Agent strings, you're fighting a losing battle because bots lie.
The "User-Agent" Lie
Most developers try to block AI bots like this:
if (userAgent.includes("GPTBot")) {
return 403;
}
This works for the "polite" bots. But aggressive scrapers simply spoof their User-Agent to look like Chrome or iPhone users. Relying on a string of text provided by the visitor is like asking a burglar if they are a burglar.
The Solution: ASN Intelligence
AI Scrapers have a weakness: they run on data center infrastructure. They typically operate from AWS, Google Cloud, Azure, or specialized hosting providers.
Real users (and real Googlebots) have distinct network signatures. By looking at the ASN (Autonomous System Number) and the Company behind the IP, we can filter them out with high precision.
How to Implement "Safe" Blocking
Using CandycornDB, you can implement a check that blocks data center traffic unless it is a verified search engine crawler.
The Logic
- Step 1: Check
company.type. If it is "hosting", flag it. - Step 2: Check
risk_score. High-volume scrapers often trigger subnet alerts. - Step 3 (Crucial): Whitelist legitimate crawlers by verifying their reverse DNS (rDNS) matches
googlebot.comorbing.com.
This method allows you to block the "GPTBots" and generic scrapers that hide in AWS/GCP, while safely letting Google and real users through to your site.
Final Thoughts
Your data is your moat. Don't let AI companies scrape it for free. By moving beyond User-Agents and looking at the network infrastructure, you can protect your content without disappearing from search results.