How to Block AI Scrapers & LLM Bots

The web has changed. It used to be that you wanted every bot to index your site. You wanted Google, Bing, and Yahoo to see your content.

In 2025, a new wave of bots—from OpenAI, Anthropic, and CommonCrawl—are hitting servers not to send you traffic, but to scrape your data for model training. They eat your bandwidth, skew your analytics, and provide zero ROI.

Blocking them is tricky. If you block too aggressively, you might accidentally block Googlebot and kill your SEO. If you rely on User-Agent strings, you're fighting a losing battle because bots lie.

The "User-Agent" Lie

Most developers try to block AI bots like this:

if (userAgent.includes("GPTBot")) {
  return 403;
}

This works for the "polite" bots. But aggressive scrapers simply spoof their User-Agent to look like Chrome or iPhone users. Relying on a string of text provided by the visitor is like asking a burglar if they are a burglar.

The Solution: ASN Intelligence

AI Scrapers have a weakness: they run on data center infrastructure. They typically operate from AWS, Google Cloud, Azure, or specialized hosting providers.

Real users (and real Googlebots) have distinct network signatures. By looking at the ASN (Autonomous System Number) and the Company behind the IP, we can filter them out with high precision.

Identifying the Imposter

BLOCKED

🤖

"I am an iPhone"

(User-Agent Spoofed)

IP: 35.196.x.x

ASN: AS15169

Org: Google Cloud

Type: Hosting

ALLOWED

👩‍💻

"I am an iPhone"

(Real Device)

IP: 174.202.x.x

ASN: AS10507

Org: Verizon Wireless

Type: ISP

How to Implement "Safe" Blocking

Using CandycornDB, you can implement a check that blocks data center traffic unless it is a verified search engine crawler.

                    The Logic
                    Step 1: Check company.type. If it is "hosting", flag it.
Step 2: Check risk_score. High-volume scrapers often trigger subnet alerts.
Step 3 (Crucial): Whitelist legitimate crawlers by verifying their reverse DNS (rDNS) matches googlebot.com or bing.com.

                

This method allows you to block the "GPTBots" and generic scrapers that hide in AWS/GCP, while safely letting Google and real users through to your site.

Final Thoughts

Your data is your moat. Don't let AI companies scrape it for free. By moving beyond User-Agents and looking at the network infrastructure, you can protect your content without disappearing from search results.