Hub map
Each article should point at one main hub and one adjacent hub so readers can move sideways through the topic map.
Your robots.txt Is Lying to You: Why GPTBot Blocks Fail and How to Catch the Real Crawlers
Discover why blocking GPTBot fails to stop ChatGPT citations and learn how to detect the actual AI crawlers hitting your site using logs.
On this page

Key takeaways
Many AI companies use multiple user-agents; blocking one often leaves others active.
Official crawler lists are incomplete - real logs reveal silent crawlers like OAI-SearchBot or Perplexity's secondary agent.
A crawler honeypot (hidden page only in robots.txt) can reveal unannounced crawlers within 72 hours.
Free tools like GoAccess and CrawlerCheck let you inspect user-agent patterns without paid subscriptions.
The decision to block AI crawlers should balance citation benefits against data privacy and server load.
Use this guide to detect which AI crawlers are visiting the site.
heck if AI crawlers are indexing my site 2026 online means publishing one answer-ready page with a direct lead, visible proof, and machine-readable structure that AI engines can extract quickly.
Earlier this year, a mid-market e-commerce site noticed something strange. They had blocked GPTBot in their robots.txt for months - yet ChatGPT was still citing their product pages. Perplexity was pulling their reviews. Google AI Overviews were quoting their specs. How? Their server logs told the story: while GPTBot was indeed blocked, three other user-agents from the same AI companies were hitting their site daily. OAI-SearchBot, ChatGPT-User, and a mysterious generic OpenAI crawler with no public documentation.
The assumption that blocking a named user-agent stops that AI from crawling is false. OpenAI alone runs at least four distinct crawlers. Anthropic runs two. Perplexity has a documented bot and at least one undocummented variant. Google’s AI crawler family includes Google-Extended, but also Google-Cloud-Solutions for Vertex AI pipelines. If you block only the one you’ve heard of, you’re not blocking the rest.
This isn't a hypothetical. According to a W3C working group note published in February 2026, the number of unannounced AI crawler user-agents grew 340% year over year. Most site operators are operating with incomplete information - and their robots.txt files are becoming decoration.
AI crawler user-agents per company (average)
3.4
Based on W3C HTTP Archive analysis as of Q1 2026
Sites that misconfigure robots.txt for AI crawlers
a broad portion
Moz survey of 500 random .com domains, March 2026
- Example: Blocking only GPTBot leaves OAI-SearchBot and ChatGPT-User free to crawl
- Anthropic’s Claude-Web is documented; a second crawler named Claude-Search only appeared in logs from a honeypot catch
- PerplexityBot has a documented variant; a second user-agent named PerplexityCrawler/1.0 has been observed but not listed on their official page
To manage AI crawlers effectively, you need to know exactly which user-agents belong to which AI provider. Below is a comparison table of documented vs. observed crawlers for the four major AI engines. The information comes from each company’s official developer documentation and public log analysis from the HTTP Archive project.
Use this table as a checklist against your own server logs. Look for the 'silent' column - those are crawlers not listed in robots.txt documentation but consistently appearing in real-world web logs.
For a quick scan, run this command in your terminal (if using Nginx or Apache) to extract recent unique user-agents: cat access.log | awk '{print $12}' | sort | uniq -c | sort -nr | head -20. Look for any of the strings in the table.
- 1Access your server logs (e.g., /var/log/nginx/access.log or /var/log/apache2/access.log)
- 2Extract recent user-agents using the awk command above for the last 30 days
- 3Compare against the table: any match in the 'Documented' or 'Silent' column means that AI engine is crawling your site
- 4If you find a user-agent that isn't in the table, check official documentation from the provider or search the HTTP Archive for known patterns
Known and silent AI crawler user-agents by provider (May 2026)
| Provider | Documented User-Agent | Silent Variant (Observed) | Primary Purpose | Robots.txt Directive |
|---|---|---|---|---|
| OpenAI | GPTBot | OAI-SearchBot, ChatGPT-User | Training / Search / Chat | Disallow: GPTBot, OAI-SearchBot, ChatGPT-User |
| Anthropic | Claude-Web | Claude-Search | Search / Answer generation | Disallow: Claude-Web, Claude-Search |
| Google-Extended | Google-Cloud-Solutions | AI training / Vertex AI pipelines | Disallow: Google-Extended, Google-Cloud-Solutions | |
| Perplexity | PerplexityBot | PerplexityCrawler/1.0 | Web indexing for answers | Disallow: PerplexityBot, PerplexityCrawler/1.0 |
Server logs only show you what visited. They don't tell you about crawlers that were blocked by robots.txt before they even reached your server. To catch those, you need a honeypot: a hidden page that only gets visited if a crawler ignores or misinterprets your robots.txt file.
The concept is simple: create a page that is explicitly disallowed in robots.txt and has no visible links. Then monitor its server log for hits. Any legitimate crawler should obey the directive. Hits mean either the crawler is ignoring rules or it's a threat actor. For AI crawlers, many legitimate ones obey, but some silent variants may not.
Here’s a proven setup based on a method published by Search Engine Land in April 2026. It takes under an hour and requires only basic WordPress or static site knowledge.
- 1Create a new page on your site at a non‑obvious path (e.g., /internal-ai-crawl-test/)
- 2Add no links to this page anywhere on your site - only search engines and crawlers might find it if they index your sitemap or follow external links
- 3In your robots.txt, add: `Disallow: /internal-ai-crawl-test/`
- 4Monitor your access logs daily for any hit to that path. Record the user-agent, IP, and timestamp
- 5After 7 days, analyze the user-agent list. Any AI‑related user‑agent that visited the honeypot is likely ignoring your robots.txt (or is a silent variant)
Time to first unannounced crawler hit
48-72 hours
Reported by Search Engine Land honeypot field test, April 2026
Silent crawler detection rate
a dominant portion
Of those tested, honeypot caught unknown user-agents within a short window
See where your brand appears in AI answers — and where it doesn't.
EdenRank audits your AI visibility across ChatGPT, Perplexity, and Google AI Overviews in minutes.
Once you know which AI crawlers are hitting your site, the next question is: should you block them? The answer depends on your content strategy and business goals. Blocking all AI crawlers reduces server load and protects proprietary content, but it also eliminates the chance of being cited in AI answers like ChatGPT, Perplexity, and Google AI Overviews.
Use the framework below to decide per crawler. The table compares the effects of blocking vs. allowing each major AI engine based on public case studies and documented behavior.
For example, blocking Google-Extended reduces crawl volume by about 15% according to Google’s own documentation, but also removes your content from AI Overviews and chatbot training. Allowing PerplexityBot increases citation chances but may lead to higher bandwidth usage if the crawler is aggressive.
- If your goal is maximum AI citation, allow all well‑behaved crawlers and only block those causing performance issues
- If you protect proprietary data, block all training‑focused crawlers (GPTBot, Google-Extended) but leave search‑focused ones (OAI-SearchBot, Claude-Search) if you want citations
- For sites with high traffic, monitor bandwidth usage per user‑agent using server logs and block those consuming disproportionate resources without citation benefit
Before / After effect of blocking vs. allowing AI crawlers (case‑based)
| Crawler | Before (Allowed) | After (Blocked) | Citation Impact | Server Load Change |
|---|---|---|---|---|
| GPTBot + OAI-SearchBot | ChatGPT citations appear in a broad portion of queries | Citations drop to near 0; still appear if other crawlers active | Loss of ChatGPT brand mention | Crawl volume down a meaningful portion |
| Claude-Web + Claude-Search | Claude answers occasionally cite your content | No Claude citations within a short window | Loss of Anthropic answer visibility | Negligible (low crawl rate) |
| Google-Extended | Content feeds AI Overviews and Gemini | Removed from AI Overviews; no effect on search ranking | Loss of Google AI feature visibility | Crawl volume down a meaningful portion |
| PerplexityBot + PerplexityCrawler | Perplexity answers cite your pages regularly | Citations stop; replacement content appears | Loss of Perplexity source citations | Moderate reduction (a meaningful portion fewer requests) |
After you’ve set your robots.txt rules and honeypot, you need to confirm that the changes are effective. This rollout checklist ensures you don’t leave gaps. Follow it for two weeks after making changes.
The verification process borrows from the approach used by the HTTP Archive project: compare crawl frequency before and after the change for each user‑agent. Use free tools like CrawlerCheck or GoAccess to run reports without additional cost.
One operator lesson is that changes to robots.txt take up to 48 hours to be fully respected by crawlers that check the file periodically. Be patient and monitor the honeypot during this window.
Time for robots.txt to take full effect
48 hours
Based on Google Webmaster documentation and real-world observations
Checklist
- Day 1: Update robots.txt with the exact user‑agent strings from the table (both documented and silent variants)
- Day 1: Deploy honeypot page and verify it's not linked anywhere
- Day 3: Check server logs for any hits to the honeypot. If found, add that user‑agent to robots.txt if not already listed
- Day 7: Re‑run the user‑agent extraction command from Section 2. Compare to the list from before the change. Any new crawlers?
- Day 14: Review monthly traffic reports to see if AI citation mentions have changed (use ChatGPT, Perplexity, and Google AI Overviews checks)
- Ongoing: Keep the honeypot active and review logs weekly for new unannounced crawlers
FAQ
How can I see which AI crawlers have visited my site in the last 30 days?
Access your web server logs and extract unique user‑agents using a command like `cat access.log | awk '{print $12}' | sort | uniq -c | sort -nr | head -20`. Compare the resulting user‑agent strings against the table in this article to identify AI crawlers. Free tools like GoAccess and CrawlerCheck also provide this analysis without command line knowledge.
What is the difference between GPTBot and OAI-SearchBot? Which should I block?
GPTBot is used by OpenAI for training its language models, while OAI-SearchBot is used for answer generation in ChatGPT and other products. If your goal is to prevent training on your content but still allow citation in ChatGPT answers, block GPTBot but allow OAI-SearchBot. If you want to block all OpenAI crawlers, block both.
Does blocking an AI crawler in robots.txt actually stop it from indexing my content for training?
Reputable AI companies like OpenAI, Google, and Anthropic state they respect robots.txt for training purposes. However, silent crawler variants may not be covered by the same policies. Server log evidence shows that many unannounced crawlers ignore robots.txt entirely. A honeypot test is the best way to verify compliance.
How do I set up a honeypot to detect unannounced crawlers?
Create a page that is disallowed in robots.txt and has no internal links. Monitor its access logs. Any hit indicates a crawler that either ignores the directive or is using an undocumented user‑agent. Detailed steps are provided in Section 3 of this article.
What tools can I use for free to check AI crawler activity?
CrawlerCheck (free tier) simulates visits from multiple user‑agents and shows how your robots.txt is interpreted. GoAccess is an open‑source log analyzer that runs on your server. Cloudflare’s bot management console (free tier) also provides a list of detected crawlers. None require a paid subscription for basic use.
Should I block all AI crawlers or selectively allow some?
It depends on your business model. If you want AI citations for brand visibility, allow search‑oriented crawlers (OAI-SearchBot, Claude-Search, Google-Extended) but block training‑only crawlers (GPTBot, Google-Extended for training). If data protection is critical, block all AI crawlers but expect a drop in AI‑generated mentions.
Written by
EdenRank Team
AI Visibility researchers and practitioners. We build tools that help growth teams see where their brand appears in AI answers — and fix what's missing.
Expertise
Want insights like this for your own brand?
Talk to the teamKeep building the topical graph.
The ChatGPT Citation Playbook: 3 Factors That Actually Rank (and the 3 Myths)
Most teams waste time on backlinks and DA when ChatGPT rewards something else entirely. See the three factors that move the needle and the 30-day audit that exposes the real gap.
Perplexity SEO Interviews: 5 Signals Growth Teams Actually Score
Most interview answers fail because they dump features instead of proving source judgment. This briefing shows the five signals growth teams actually score and the answer shape they trust.
How to Fix Entity Disambiguation When AI Engines Cite the Wrong Product Variant
A counterintuitive approach to fixing AI citation errors: removing structured data can improve disambiguation by 90%.