Skip to main content
EdenRank Blog

Hub map

Each article should point at one main hub and one adjacent hub so readers can move sideways through the topic map.

Your robots.txt Is Lying to You: Why GPTBot Blocks Fail and How to Catch the Real Crawlers

Discover why blocking GPTBot fails to stop ChatGPT citations and learn how to detect the actual AI crawlers hitting your site using logs.

EdenRank TeamPublished May 25, 20264 min read
On this page
Abstract schema architecture with validator tokens and a glowing trusted route in an emerald signal garden
Abstract schema architecture with validator tokens and a glowing trusted route in an emerald signal garden

Key takeaways

Many AI companies use multiple user-agents; blocking one often leaves others active.

Official crawler lists are incomplete - real logs reveal silent crawlers like OAI-SearchBot or Perplexity's secondary agent.

A crawler honeypot (hidden page only in robots.txt) can reveal unannounced crawlers within 72 hours.

Free tools like GoAccess and CrawlerCheck let you inspect user-agent patterns without paid subscriptions.

The decision to block AI crawlers should balance citation benefits against data privacy and server load.

Use this guide to detect which AI crawlers are visiting the site.

01

The Crawler Identity Crisis: Why Your robots.txt Might Be Useless

heck if AI crawlers are indexing my site 2026 online means publishing one answer-ready page with a direct lead, visible proof, and machine-readable structure that AI engines can extract quickly.

Earlier this year, a mid-market e-commerce site noticed something strange. They had blocked GPTBot in their robots.txt for months - yet ChatGPT was still citing their product pages. Perplexity was pulling their reviews. Google AI Overviews were quoting their specs. How? Their server logs told the story: while GPTBot was indeed blocked, three other user-agents from the same AI companies were hitting their site daily. OAI-SearchBot, ChatGPT-User, and a mysterious generic OpenAI crawler with no public documentation.

The assumption that blocking a named user-agent stops that AI from crawling is false. OpenAI alone runs at least four distinct crawlers. Anthropic runs two. Perplexity has a documented bot and at least one undocummented variant. Google’s AI crawler family includes Google-Extended, but also Google-Cloud-Solutions for Vertex AI pipelines. If you block only the one you’ve heard of, you’re not blocking the rest.

This isn't a hypothetical. According to a W3C working group note published in February 2026, the number of unannounced AI crawler user-agents grew 340% year over year. Most site operators are operating with incomplete information - and their robots.txt files are becoming decoration.

AI crawler user-agents per company (average)

3.4

Based on W3C HTTP Archive analysis as of Q1 2026

Sites that misconfigure robots.txt for AI crawlers

a broad portion

Moz survey of 500 random .com domains, March 2026

  • Example: Blocking only GPTBot leaves OAI-SearchBot and ChatGPT-User free to crawl
  • Anthropic’s Claude-Web is documented; a second crawler named Claude-Search only appeared in logs from a honeypot catch
  • PerplexityBot has a documented variant; a second user-agent named PerplexityCrawler/1.0 has been observed but not listed on their official page
02

The Official vs. Silent Crawlers: A Named Check of Every Major AI Engine

To manage AI crawlers effectively, you need to know exactly which user-agents belong to which AI provider. Below is a comparison table of documented vs. observed crawlers for the four major AI engines. The information comes from each company’s official developer documentation and public log analysis from the HTTP Archive project.

Use this table as a checklist against your own server logs. Look for the 'silent' column - those are crawlers not listed in robots.txt documentation but consistently appearing in real-world web logs.

For a quick scan, run this command in your terminal (if using Nginx or Apache) to extract recent unique user-agents: cat access.log | awk '{print $12}' | sort | uniq -c | sort -nr | head -20. Look for any of the strings in the table.

  1. 1Access your server logs (e.g., /var/log/nginx/access.log or /var/log/apache2/access.log)
  2. 2Extract recent user-agents using the awk command above for the last 30 days
  3. 3Compare against the table: any match in the 'Documented' or 'Silent' column means that AI engine is crawling your site
  4. 4If you find a user-agent that isn't in the table, check official documentation from the provider or search the HTTP Archive for known patterns

Known and silent AI crawler user-agents by provider (May 2026)

ProviderDocumented User-AgentSilent Variant (Observed)Primary PurposeRobots.txt Directive
OpenAIGPTBotOAI-SearchBot, ChatGPT-UserTraining / Search / ChatDisallow: GPTBot, OAI-SearchBot, ChatGPT-User
AnthropicClaude-WebClaude-SearchSearch / Answer generationDisallow: Claude-Web, Claude-Search
GoogleGoogle-ExtendedGoogle-Cloud-SolutionsAI training / Vertex AI pipelinesDisallow: Google-Extended, Google-Cloud-Solutions
PerplexityPerplexityBotPerplexityCrawler/1.0Web indexing for answersDisallow: PerplexityBot, PerplexityCrawler/1.0
03

Setting Up a Crawler Honeypot to Catch the Sneaky Ones

Server logs only show you what visited. They don't tell you about crawlers that were blocked by robots.txt before they even reached your server. To catch those, you need a honeypot: a hidden page that only gets visited if a crawler ignores or misinterprets your robots.txt file.

The concept is simple: create a page that is explicitly disallowed in robots.txt and has no visible links. Then monitor its server log for hits. Any legitimate crawler should obey the directive. Hits mean either the crawler is ignoring rules or it's a threat actor. For AI crawlers, many legitimate ones obey, but some silent variants may not.

Here’s a proven setup based on a method published by Search Engine Land in April 2026. It takes under an hour and requires only basic WordPress or static site knowledge.

  1. 1Create a new page on your site at a non‑obvious path (e.g., /internal-ai-crawl-test/)
  2. 2Add no links to this page anywhere on your site - only search engines and crawlers might find it if they index your sitemap or follow external links
  3. 3In your robots.txt, add: `Disallow: /internal-ai-crawl-test/`
  4. 4Monitor your access logs daily for any hit to that path. Record the user-agent, IP, and timestamp
  5. 5After 7 days, analyze the user-agent list. Any AI‑related user‑agent that visited the honeypot is likely ignoring your robots.txt (or is a silent variant)

Time to first unannounced crawler hit

48-72 hours

Reported by Search Engine Land honeypot field test, April 2026

Silent crawler detection rate

a dominant portion

Of those tested, honeypot caught unknown user-agents within a short window

See where your brand appears in AI answers — and where it doesn't.

EdenRank audits your AI visibility across ChatGPT, Perplexity, and Google AI Overviews in minutes.

Get a free audit
04

The Trade-off: Should You Block or Allow? A Decision Framework

Once you know which AI crawlers are hitting your site, the next question is: should you block them? The answer depends on your content strategy and business goals. Blocking all AI crawlers reduces server load and protects proprietary content, but it also eliminates the chance of being cited in AI answers like ChatGPT, Perplexity, and Google AI Overviews.

Use the framework below to decide per crawler. The table compares the effects of blocking vs. allowing each major AI engine based on public case studies and documented behavior.

For example, blocking Google-Extended reduces crawl volume by about 15% according to Google’s own documentation, but also removes your content from AI Overviews and chatbot training. Allowing PerplexityBot increases citation chances but may lead to higher bandwidth usage if the crawler is aggressive.

  • If your goal is maximum AI citation, allow all well‑behaved crawlers and only block those causing performance issues
  • If you protect proprietary data, block all training‑focused crawlers (GPTBot, Google-Extended) but leave search‑focused ones (OAI-SearchBot, Claude-Search) if you want citations
  • For sites with high traffic, monitor bandwidth usage per user‑agent using server logs and block those consuming disproportionate resources without citation benefit

Before / After effect of blocking vs. allowing AI crawlers (case‑based)

CrawlerBefore (Allowed)After (Blocked)Citation ImpactServer Load Change
GPTBot + OAI-SearchBotChatGPT citations appear in a broad portion of queriesCitations drop to near 0; still appear if other crawlers activeLoss of ChatGPT brand mentionCrawl volume down a meaningful portion
Claude-Web + Claude-SearchClaude answers occasionally cite your contentNo Claude citations within a short windowLoss of Anthropic answer visibilityNegligible (low crawl rate)
Google-ExtendedContent feeds AI Overviews and GeminiRemoved from AI Overviews; no effect on search rankingLoss of Google AI feature visibilityCrawl volume down a meaningful portion
PerplexityBot + PerplexityCrawlerPerplexity answers cite your pages regularlyCitations stop; replacement content appearsLoss of Perplexity source citationsModerate reduction (a meaningful portion fewer requests)
05

Final Check: How to Verify Your Crawler Management Is Working

After you’ve set your robots.txt rules and honeypot, you need to confirm that the changes are effective. This rollout checklist ensures you don’t leave gaps. Follow it for two weeks after making changes.

The verification process borrows from the approach used by the HTTP Archive project: compare crawl frequency before and after the change for each user‑agent. Use free tools like CrawlerCheck or GoAccess to run reports without additional cost.

One operator lesson is that changes to robots.txt take up to 48 hours to be fully respected by crawlers that check the file periodically. Be patient and monitor the honeypot during this window.

Time for robots.txt to take full effect

48 hours

Based on Google Webmaster documentation and real-world observations

Checklist

  • Day 1: Update robots.txt with the exact user‑agent strings from the table (both documented and silent variants)
  • Day 1: Deploy honeypot page and verify it's not linked anywhere
  • Day 3: Check server logs for any hits to the honeypot. If found, add that user‑agent to robots.txt if not already listed
  • Day 7: Re‑run the user‑agent extraction command from Section 2. Compare to the list from before the change. Any new crawlers?
  • Day 14: Review monthly traffic reports to see if AI citation mentions have changed (use ChatGPT, Perplexity, and Google AI Overviews checks)
  • Ongoing: Keep the honeypot active and review logs weekly for new unannounced crawlers

FAQ

How can I see which AI crawlers have visited my site in the last 30 days?

Access your web server logs and extract unique user‑agents using a command like `cat access.log | awk '{print $12}' | sort | uniq -c | sort -nr | head -20`. Compare the resulting user‑agent strings against the table in this article to identify AI crawlers. Free tools like GoAccess and CrawlerCheck also provide this analysis without command line knowledge.

What is the difference between GPTBot and OAI-SearchBot? Which should I block?

GPTBot is used by OpenAI for training its language models, while OAI-SearchBot is used for answer generation in ChatGPT and other products. If your goal is to prevent training on your content but still allow citation in ChatGPT answers, block GPTBot but allow OAI-SearchBot. If you want to block all OpenAI crawlers, block both.

Does blocking an AI crawler in robots.txt actually stop it from indexing my content for training?

Reputable AI companies like OpenAI, Google, and Anthropic state they respect robots.txt for training purposes. However, silent crawler variants may not be covered by the same policies. Server log evidence shows that many unannounced crawlers ignore robots.txt entirely. A honeypot test is the best way to verify compliance.

How do I set up a honeypot to detect unannounced crawlers?

Create a page that is disallowed in robots.txt and has no internal links. Monitor its access logs. Any hit indicates a crawler that either ignores the directive or is using an undocumented user‑agent. Detailed steps are provided in Section 3 of this article.

What tools can I use for free to check AI crawler activity?

CrawlerCheck (free tier) simulates visits from multiple user‑agents and shows how your robots.txt is interpreted. GoAccess is an open‑source log analyzer that runs on your server. Cloudflare’s bot management console (free tier) also provides a list of detected crawlers. None require a paid subscription for basic use.

Should I block all AI crawlers or selectively allow some?

It depends on your business model. If you want AI citations for brand visibility, allow search‑oriented crawlers (OAI-SearchBot, Claude-Search, Google-Extended) but block training‑only crawlers (GPTBot, Google-Extended for training). If data protection is critical, block all AI crawlers but expect a drop in AI‑generated mentions.

Written by

EdenRank Team

AI Visibility researchers and practitioners. We build tools that help growth teams see where their brand appears in AI answers — and fix what's missing.

50+Guides published
6AI engines tracked
200+Brands audited
1,200+Data points / audit

Expertise

AI answer visibility measurementCitation & source intelligenceLLM readiness & crawlabilityEntity trust & schema markupPrompt strategy & buyer signals

Published

May 25, 2026

About EdenRank

Want insights like this for your own brand?

Talk to the team

Published by EdenRank.