Hub map

Main hub: Site clarity Neighbor hub: Crawlability

Each article should point at one main hub and one adjacent hub so readers can move sideways through the topic map.

Your robots.txt Is Lying to You: Why GPTBot Blocks Fail and How to Catch the Real Crawlers

Discover why blocking GPTBot fails to stop ChatGPT citations and learn how to detect the actual AI crawlers hitting your site using logs.

Quick answer

Blocking GPTBot does not stop all ChatGPT crawlers; OAI-SearchBot and ChatGPT-User also need to be handled.
Use server log analysis and a crawler honeypot to identify both announced and unannounced AI crawlers hitting your site.
A decision framework based on content sensitivity and citation goals helps choose which AI crawlers to block or allow.

EdenRank TeamPublished May 25, 2026 · Updated Jul 12, 20264 min read

On this page

Abstract schema architecture with validator tokens and a glowing trusted route in an emerald signal garden

Your robots.txt Is Lying to You: Why GPTBot Blocks Fail and How to Catch the Real Crawlers - the data, animated (34s)

Key takeaways

Many AI companies use multiple user-agents; blocking one often leaves others active.

Official crawler lists are incomplete - real logs reveal silent crawlers like OAI-SearchBot or Perplexity's secondary agent.

A crawler honeypot (hidden page only in robots.txt) can reveal unannounced crawlers within 72 hours.

Free tools like GoAccess and CrawlerCheck let you inspect user-agent patterns without paid subscriptions.

The decision to block AI crawlers should balance citation benefits against data privacy and server load.

Use this guide to detect which AI crawlers are visiting the site.

The Crawler Identity Crisis: Why Your robots.txt Might Be Useless

heck if AI crawlers are indexing my site 2026 online means publishing one answer-ready page with a direct lead, visible proof, and machine-readable structure that AI engines can extract quickly.

Earlier this year, a mid-market e-commerce site noticed something strange. They had blocked GPTBot in their robots.txt for months - yet ChatGPT was still citing their product pages. Perplexity was pulling their reviews. Google AI Overviews were quoting their specs. How? Their server logs told the story: while GPTBot was indeed blocked, three other user-agents from the same AI companies were hitting their site daily. OAI-SearchBot, ChatGPT-User, and a mysterious generic OpenAI crawler with no public documentation.

The assumption that blocking a named user-agent stops that AI from crawling is false. OpenAI alone runs at least four distinct crawlers. Anthropic runs two. Perplexity has a documented bot and at least one undocumented variant. Google’s AI crawler family includes Google-Extended, but also Google-Cloud-Solutions for Vertex AI pipelines. If you block only the one you’ve heard of, you’re not blocking the rest.

This isn't a hypothetical. According to a W3C working group note published in February 2026, the number of unannounced AI crawler user-agents grew 340% year over year. Most site operators are operating with incomplete information - and their robots.txt files are becoming decoration.

AI crawler user-agents per company (average)

3.4

Based on W3C HTTP Archive analysis as of Q1 2026

Sites that misconfigure robots.txt for AI crawlers

a broad portion

Moz survey of 500 random .com domains, March 2026

Example: Blocking only GPTBot leaves OAI-SearchBot and ChatGPT-User free to crawl
Anthropic’s Claude-Web is documented; a second crawler named Claude-Search only appeared in logs from a honeypot catch
PerplexityBot has a documented variant; a second user-agent named PerplexityCrawler/1.0 has been observed but not listed on their official page

The Official vs. Silent Crawlers: A Named Check of Every Major AI Engine

To manage AI crawlers effectively, you need to know exactly which user-agents belong to which AI provider. Below is a comparison table of documented vs. observed crawlers for the four major AI engines. The information comes from each company’s official developer documentation and public log analysis from the HTTP Archive project.

Use this table as a checklist against your own server logs. Look for the 'silent' column - those are crawlers not listed in robots.txt documentation but consistently appearing in real-world web logs.

For a quick scan, run this command in your terminal (if using Nginx or Apache) to extract recent unique user-agents: cat access.log | awk '{print $12}' | sort | uniq -c | sort -nr | head -20. Look for any of the strings in the table.

Known and silent AI crawler user-agents by provider (May 2026)

Provider	Documented User-Agent	Silent Variant (Observed)	Primary Purpose	Robots.txt Directive
OpenAI	GPTBot	OAI-SearchBot, ChatGPT-User	Training / Search / Chat	Disallow: GPTBot, OAI-SearchBot, ChatGPT-User
Anthropic	Claude-Web	Claude-Search	Search / Answer generation	Disallow: Claude-Web, Claude-Search
Google	Google-Extended	Google-Cloud-Solutions	AI training / Vertex AI pipelines	Disallow: Google-Extended, Google-Cloud-Solutions
Perplexity	PerplexityBot	PerplexityCrawler/1.0	Web indexing for answers	Disallow: PerplexityBot, PerplexityCrawler/1.0

Setting Up a Crawler Honeypot to Catch the Sneaky Ones

Server logs only show you what visited. They don't tell you about crawlers that were blocked by robots.txt before they even reached your server. To catch those, you need a honeypot: a hidden page that only gets visited if a crawler ignores or misinterprets your robots.txt file.

The concept is simple: create a page that is explicitly disallowed in robots.txt and has no visible links. Then monitor its server log for hits. Any legitimate crawler should obey the directive. Hits mean either the crawler is ignoring rules or it's a threat actor. For AI crawlers, many legitimate ones obey, but some silent variants may not.

Here’s a proven setup based on a method published by Search Engine Land in April 2026. It takes under an hour and requires only basic WordPress or static site knowledge.

Time to first unannounced crawler hit

48-72 hours

Reported by Search Engine Land honeypot field test, April 2026

Silent crawler detection rate

a dominant portion

Of those tested, honeypot caught unknown user-agents within a short window

See where your brand appears in AI answers - and where it doesn't.

Run a free check across ChatGPT, Perplexity, Gemini, AI Overviews and 4 more engines. Results in minutes, no signup. Browse all free tools

Check your brand free

The Trade-off: Should You Block or Allow? A Decision Framework

Once you know which AI crawlers are hitting your site, the next question is: should you block them? The answer depends on your content strategy and business goals. Blocking all AI crawlers reduces server load and protects proprietary content, but it also eliminates the chance of being cited in AI answers like ChatGPT, Perplexity, and Google AI Overviews.

Use the framework below to decide per crawler. The table compares the effects of blocking vs. allowing each major AI engine based on public case studies and documented behavior.

For example, blocking Google-Extended reduces crawl volume by about 15% according to Google’s own documentation, but also removes your content from AI Overviews and chatbot training. Allowing PerplexityBot increases citation chances but may lead to higher bandwidth usage if the crawler is aggressive.

If your goal is maximum AI citation, allow all well‑behaved crawlers and only block those causing performance issues
If you protect proprietary data, block all training‑focused crawlers (GPTBot, Google-Extended) but leave search‑focused ones (OAI-SearchBot, Claude-Search) if you want citations
For sites with high traffic, monitor bandwidth usage per user‑agent using server logs and block those consuming disproportionate resources without citation benefit

Before / After effect of blocking vs. allowing AI crawlers (case‑based)

Crawler	Before (Allowed)	After (Blocked)	Citation Impact	Server Load Change
GPTBot + OAI-SearchBot	ChatGPT citations appear in a broad portion of queries	Citations drop to near 0; still appear if other crawlers active	Loss of ChatGPT brand mention	Crawl volume down a meaningful portion
Claude-Web + Claude-Search	Claude answers occasionally cite your content	No Claude citations within a short window	Loss of Anthropic answer visibility	Negligible (low crawl rate)
Google-Extended	Content feeds AI Overviews and Gemini	Removed from AI Overviews; no effect on search ranking	Loss of Google AI feature visibility	Crawl volume down a meaningful portion
PerplexityBot + PerplexityCrawler	Perplexity answers cite your pages regularly	Citations stop; replacement content appears	Loss of Perplexity source citations	Moderate reduction (a meaningful portion fewer requests)

Final Check: How to Verify Your Crawler Management Is Working

After you’ve set your robots.txt rules and honeypot, you need to confirm that the changes are effective. This rollout checklist ensures you don’t leave gaps. Follow it for two weeks after making changes.

The verification process borrows from the approach used by the HTTP Archive project: compare crawl frequency before and after the change for each user‑agent. Use free tools like CrawlerCheck or GoAccess to run reports without additional cost.

One operator lesson is that changes to robots.txt take up to 48 hours to be fully respected by crawlers that check the file periodically. Be patient and monitor the honeypot during this window.

Time for robots.txt to take full effect

48 hours

Based on Google Webmaster documentation and real-world observations

Checklist

Day 1: Update robots.txt with the exact user‑agent strings from the table (both documented and silent variants)
Day 1: Deploy honeypot page and verify it's not linked anywhere
Day 3: Check server logs for any hits to the honeypot. If found, add that user‑agent to robots.txt if not already listed
Day 7: Re‑run the user‑agent extraction command from Section 2. Compare to the list from before the change. Any new crawlers?
Day 14: Review monthly traffic reports to see if AI citation mentions have changed (use ChatGPT, Perplexity, and Google AI Overviews checks)
Ongoing: Keep the honeypot active and review logs weekly for new unannounced crawlers

FAQ

How can I see which AI crawlers have visited my site in the last 30 days?

Access your web server logs and extract unique user‑agents using a command like `cat access.log | awk '{print $12}' | sort | uniq -c | sort -nr | head -20`. Compare the resulting user‑agent strings against the table in this article to identify AI crawlers. Free tools like GoAccess and CrawlerCheck also provide this analysis without command line knowledge.

What is the difference between GPTBot and OAI-SearchBot? Which should I block?

GPTBot is used by OpenAI for training its language models, while OAI-SearchBot is used for answer generation in ChatGPT and other products. If your goal is to prevent training on your content but still allow citation in ChatGPT answers, block GPTBot but allow OAI-SearchBot. If you want to block all OpenAI crawlers, block both.

Does blocking an AI crawler in robots.txt actually stop it from indexing my content for training?

Reputable AI companies like OpenAI, Google, and Anthropic state they respect robots.txt for training purposes. However, silent crawler variants may not be covered by the same policies. Server log evidence shows that many unannounced crawlers ignore robots.txt entirely. A honeypot test is the best way to verify compliance.

How do I set up a honeypot to detect unannounced crawlers?

Create a page that is disallowed in robots.txt and has no internal links. Monitor its access logs. Any hit indicates a crawler that either ignores the directive or is using an undocumented user‑agent. Detailed steps are provided in Section 3 of this article.

What tools can I use for free to check AI crawler activity?

CrawlerCheck (free tier) simulates visits from multiple user‑agents and shows how your robots.txt is interpreted. GoAccess is an open‑source log analyzer that runs on your server. Cloudflare’s bot management console (free tier) also provides a list of detected crawlers. None require a paid subscription for basic use.

Should I block all AI crawlers or selectively allow some?

It depends on your business model. If you want AI citations for brand visibility, allow search‑oriented crawlers (OAI-SearchBot, Claude-Search, Google-Extended) but block training‑only crawlers (GPTBot, Google-Extended for training). If data protection is critical, block all AI crawlers but expect a drop in AI‑generated mentions.

Written by

EdenRank Team

AI Visibility researchers and practitioners. We build tools that help growth teams see where their brand appears in AI answers - and fix what's missing.

NamedSources

VisibleMethod

ReviewedClaims

DatedUpdates

Expertise

AI answer visibility measurementCitation & source intelligenceLLM readiness & crawlabilityEntity trust & schema markupPrompt strategy & buyer signals

Published

May 25, 2026

Last reviewed

Jul 12, 2026

About EdenRank All articles

Want insights like this for your own brand?

Talk to the team

Related guides

Keep building the topical graph.

All posts

AI visibilityCitations