Hub map

Main hub: Crawlability Neighbor hub: Monitoring

Each article should point at one main hub and one adjacent hub so readers can move sideways through the topic map.

How to Use Server Logs to Detect Unannounced AI Crawlers

Unannounced AI crawlers now represent a measurable share of bot traffic - and the teams in the cited examples miss them entirely.

Quick answer

To detect unannounced AI crawlers, extract user-agent strings from raw server logs using `awk -F'"' '{print $6}' access.log | sort | uniq -c | sort -rn`, then diff the output against a whitelist of documented crawlers including GPTBot, ClaudeBot
Validate suspicious user-agent strings by running reverse DNS lookups on the requesting IP with `dig -x IP ` and checking the result against each crawler operator's published IP ranges - a mismatch indicates a spoofed or unannounced crawler.
Block confirmed-unknown crawlers at the nginx or Apache level using user-agent map directives or IP CIDR deny rules, and verify the block with `curl -A 'BlockedAgentString' -I https://yourdomain.com` returning 403 while GPTBot and PerplexityBot still return

EdenRank TeamPublished Jun 26, 202611 min read

On this page

Scanning microscope detecting brand-name signals in AI-generated text — Server Detect Unannounced Crawlers..

TL;DR

Core problem: A growing share of AI crawlers use obfuscated or browser-mimicking user-agent strings that bypass standard bot detection.
Detection method: Pull user-agent strings from raw access logs, diff against a known-good whitelist, then validate suspicious entries via reverse DNS and IP reputation lookup.
Tools needed: GoAccess or a Python script for log parsing, ipinfo.io or Shodan for IP reputation, and your web server config for blocking rules.
Outcome: Whitelisting known AI crawlers (GPTBot, ClaudeBot, Google-Extended) while blocking unverified ones protects content attribution in AI answers.

11 min🟡 intermediate🛠️ GoAccess🛠️ Python🛠️ nginx🛠️ .htaccess🛠️ ipinfo.io

Who this is for

✅ Good fit

SEO operators who have access to raw Apache or nginx access logs
Growth leads whose brand appears in AI answers and want to control which crawlers index their content
Site owners who suspect unknown bots are scraping content without attribution

❌ Not for

✕Teams on fully managed hosting with no log access (Squarespace, Wix)
✕Engineers building crawler infrastructure — this is the defender's playbook, not the attacker's

Key takeaways

Run `awk -F'"' '{print $6}' access.log | sort | uniq -c | sort -rn` weekly to extract and rank every user-agent string hitting your site.

Validate suspicious user-agents via reverse DNS with `dig -x IP ` before writing any blocking rule - a spoofed GPTBot string is not the same as a real one.

Whitelist GPTBot, ClaudeBot, PerplexityBot, and Google-Extended explicitly in both robots.txt and your server config before adding any block rules.

Enforce blocks at the nginx or Apache level, not in robots.txt - non-compliant crawlers never fetch robots.txt.

Run a curl test with the blocked user-agent string within 60 seconds of deploying a new rule to confirm it returns 403.

Schedule a monthly four-step review: extract, diff, validate, update - the whole cycle takes under an hour once your baseline is set.

How to Understand Why Your Existing Defenses Miss Unannounced AI Crawlers

nannounced AI crawlers evade detection because they do not announce themselves - they send requests with user-agent strings that mimic Chrome, Safari, or generic curl clients, so your WAF and robots.txt enforcement never trigger. The mechanism is straightforward: a crawler operator sets User-Agent: Mozilla/5.0 (compatible; ) and your server treats it as a browser session. By the time you notice unusual traffic patterns, the crawler has already pulled thousands of pages. The fix starts in your access logs, not in [your robots.txt](/blog/robots-txt-lying-gptbot-blocks-fail-ai-crawlers).

the teams in the cited examples check server logs reactively - after a traffic spike or a hosting bill anomaly. That cadence is too slow for AI crawler detection. A crawler completing a full site pull in 90 minutes leaves no spike in your analytics (it does not execute JavaScript), but it leaves a dense, time-compressed pattern in your raw access log: hundreds of sequential GET requests, zero CSS or image fetches, and a consistent Accept header that no real browser sends. These signals are readable in GoAccess or a basic Python script in under 20 minutes.

The practical stakes are content attribution. When an unannounced crawler pulls your pages and feeds them into a training pipeline or a retrieval-augmented generation system, your content may surface in AI answers without your domain being cited. Known crawlers - `GPTBot`, `ClaudeBot`, `Google-Extended` - are documented, and their operators have stated citation policies. Unknown crawlers have no such policy on record. Identifying them is the first step to deciding whether to permit, restrict, or block them entirely.

One common objection is that server logs are too noisy to parse manually. That is true if you are reading raw text. It is false once you extract and sort by user-agent string. A site receiving 500,000 requests per day will typically have fewer than 200 distinct user-agent strings. Filtering to non-browser agents takes the list to under 40. From there, diffing against a known-good whitelist of documented crawlers leaves you with a manageable set of 5-15 unknowns per audit cycle - a 20-minute task, not a data engineering project.

In this article

1.Why unannounced crawlers bypass your existing defenses
2.How to extract and baseline user-agent strings from raw logs
3.How to validate suspicious entries with reverse DNS and IP reputation
4.How to classify crawlers by citation value before blocking
5.How to write and deploy blocking rules without hurting AI visibility
6.How to verify the block worked and monitor for new variants

How to Extract and Baseline User-Agent Strings from Raw Logs

Start with your raw access log - typically /var/log/nginx/access.log or /var/log/apache2/access.log. If your host abstracts logs behind a dashboard, request a raw export in Combined Log Format. The user-agent field is the last quoted string on each line. You do not need a commercial tool for this step: a single awk or grep command extracts every unique user-agent string and its request count in under 60 seconds on a log file of any reasonable size.

Run this command to get a ranked frequency list of user-agent strings: awk -F'"' '{print $6}' access.log | sort | uniq -c | sort -rn | head -100. The output gives you the top 100 user-agents by request volume. Browser strings (Mozilla/5.0) will dominate the top of the list. Scroll past them to the bot-style strings - anything without a Mozilla/5.0 prefix or with terms like bot, crawler, spider, fetch, scraper, or a bare version string like python-requests/2.31. These are your candidates.

Build a whitelist of documented AI crawlers and index bots. The canonical list for AI visibility purposes includes: `GPTBot` (OpenAI), `ClaudeBot` (Anthropic), `Google-Extended` (Google DeepMind training opt-out), `Googlebot`, `Bingbot`, `Applebot`, `PerplexityBot`, `CCBot` (Common Crawl), and `Amazonbot`. Cross-reference this list against Google Search Central's crawler documentation and each AI provider's published user-agent pages. Any string in your log not matching a known entry in this whitelist is a candidate for further investigation.

GoAccess is the fastest interactive option for this step if you want a visual interface rather than command-line output. Install it with apt install goaccess or brew install goaccess, then run goaccess access.log -o report.html --log-format=COMBINED and open the HTML report. The Visitors by Browser panel shows user-agent breakdowns. GoAccess is free, open-source, and processes multi-gigabyte logs in seconds. For teams running log aggregation through a service like Datadog or Splunk, the equivalent query is a top aggregation on the http.useragent field filtered to non-browser strings.

One recurring pattern we see is that run awk -F'"' '{print $6}' access.log | sort | uniq -c | sort -rn weekly to extract and rank every user-agent string hitting your site.

💡 Rotate your log baseline monthly

AI crawler operators update user-agent strings when they deploy new model versions. A string that was absent last month may appear this month. Run the extraction command on a fresh log export every 30 days, not just when you suspect a problem.

How to Validate Suspicious Crawlers with Reverse DNS and IP Reputation

A user-agent string is a self-reported value - any crawler can claim to be GPTBot. Validation requires checking the IP address the request came from, not the string the crawler sent. Extract the IP addresses associated with each suspicious user-agent string using: grep 'SuspiciousAgentString' access.log | awk '{print $1}' | sort | uniq. Then run a reverse DNS lookup on each IP with dig -x IP or host IP . Legitimate crawlers from major operators resolve to hostnames within their declared IP ranges - for example, Googlebot resolves to *.googlebot.com, and GPTBot resolves to *.openai.com.

For IP reputation checking beyond reverse DNS, use ipinfo.io's free API. Send a GET request to https://ipinfo.io/ IP /json - the response includes org (the owning ASN), hostname, and country. An IP claiming to be a US-based AI crawler but resolving to an ASN registered in a jurisdiction with no known AI lab is a strong signal of obfuscation. Shodan's free tier also shows open ports and service banners for any IP, which can reveal whether the host is a datacenter, residential proxy, or known scraping infrastructure.

Cross-reference suspicious IPs against published IP range documents. OpenAI publishes its GPTBot IP ranges at https://openai.com/gptbot-ranges.txt. Google publishes Googlebot IP ranges through its Search Central documentation. Anthropic has published ClaudeBot documentation through its usage policy pages. If an IP claiming to be one of these crawlers does not appear in the published range, it is either a spoofed user-agent or an undocumented crawler variant - both warrant blocking or rate-limiting.

Document your findings in a simple spreadsheet: IP address, reverse DNS result, ASN, claimed user-agent, request count in the log period, and your classification (known-good, unknown-benign, suspicious, confirmed-bad). This record becomes your audit trail and your input for the blocking decision in the next step. Teams that skip documentation end up re-investigating the same IPs every quarter. A shared Google Sheet with these columns takes 10 minutes to set up and saves hours of repeated work.

Reverse DNS validation outcomes for common crawler patterns

Claimed User-Agent	Reverse DNS Result	In Published IP Range	Classification
GPTBot/1.0	Resolves to *.openai.com	✅Yes	✅Known-good
ClaudeBot/1.0	Resolves to *.anthropic.com	✅Yes	✅Known-good
Mozilla/5.0 (compatible; MyAIBot)	No PTR record	❌No	⚠️Suspicious
python-requests/2.31.0	Resolves to datacenter ASN	❌No	❌Unknown scraper
GPTBot/1.0 (spoofed)	Resolves to non-OpenAI ASN	❌No	❌Spoofed - block
PerplexityBot/1.0	Resolves to *.Perplexity.ai	✅Yes	✅Known-good

See where your brand appears in AI answers - and where it doesn't.

EdenRank audits your AI visibility across ChatGPT, Perplexity, and Google AI Overviews in minutes.

Get a free audit

How to Classify Crawlers by Citation Value Before Blocking

Not every unknown crawler deserves a block. The classification question is: does permitting this crawler improve or harm my brand's presence in AI-generated answers? Known crawlers from AI engines that cite sources - GPTBot (ChatGPT), PerplexityBot (Perplexity), Google-Extended (Google AI Overviews), ClaudeBot (Claude) - should be whitelisted explicitly in your robots.txt and never blocked at the server level. Blocking them removes your content from the citation pool. This is the most common self-inflicted AI visibility wound in page audits.

The second category is known training crawlers with no direct citation path. CCBot (Common Crawl) feeds datasets used substantially open-source models. Permitting it does not guarantee citation, but blocking it removes your content from a widely used training corpus. The decision here depends on whether you want your content in open training data. There is no universally correct answer - but the decision should be deliberate, not accidental. Document it in your crawler policy.

The third category is unverified crawlers: IPs with no reverse DNS, ASNs registered to unknown entities, and user-agent strings that do not match any published crawler documentation. These are the targets of your blocking effort. They consume bandwidth, contribute to content dilution in AI retrieval systems, and offer no citation upside. Blocking them does not reduce your AI visibility - it removes noise that can dilute your content's attribution signal in retrieval-augmented systems.

The fourth category requires a judgment call: emerging AI crawlers from startups that have published a user-agent string and a stated policy but are not yet widely cited by AI engines. Examples include crawlers from newer AI search tools that have not yet reached significant user bases. For these, a rate-limit rather than a hard block is the right posture - allow crawling at a capped rate (e.g., one request per five seconds via your nginx limit_req directive) while you monitor whether the crawler's associated product begins generating citations.

Before classification

Before

Blocking all unknown user-agents indiscriminately, including GPTBot and PerplexityBot, removing content from AI citation pools entirely

After

Whitelist confirmed AI citation crawlers, rate-limit emerging unknowns, hard-block verified scrapers — citation coverage intact, noise reduced

“Blocking an unknown crawler that turns out to be PerplexityBot is the fastest way to disappear from AI answers. Validate before you block.”

— EdenRank operator observation

How to Write and Deploy Blocking Rules Without Hurting AI Visibility

Once you have classified your crawlers, implement your allow/block decisions at the web server level - not just in robots.txt. Robots.txt is advisory; a non-compliant crawler ignores it. Server-level rules enforce the decision regardless of crawler behavior. For nginx, use the map directive to match user-agent strings and return a 403 for blocked agents. For Apache, use mod_rewrite or BrowserMatch directives in .htaccess. Both approaches apply rules before your application stack processes the request, which means zero PHP or Node.js overhead for blocked requests.

For nginx, the blocking rule for a specific user-agent looks like this: in your nginx.conf or a site config file, add a map $http_user_agent $blocked_agent block listing the strings to block with value 1, then in your server block add if ($blocked_agent) { return 403; }. For IP-based blocking - appropriate when a crawler rotates user-agents but uses a consistent IP range - use deny <IP_CIDR>; inside your location / block. Combine both approaches for crawlers that spoof user-agents and use a consistent ASN.

Your robots.txt should still reflect your intent for documented crawlers, even though it does not enforce anything for non-compliant bots. Add explicit Allow: / directives for GPTBot, ClaudeBot, Google-Extended, and PerplexityBot. This signals to compliant crawlers that you welcome their indexing, and it is the mechanism Google's Search Central documentation identifies for opting into AI training and retrieval. Do not rely on Disallow: * for unknown agents - it has no effect on agents that do not fetch robots.txt.

After deploying rules, verify them with a curl test from a separate machine: curl -A 'BlockedAgentString' -I https://yourdomain.com should return HTTP/1.1 403 Forbidden. Then check that your whitelisted crawlers are unaffected: curl -A 'GPTBot/1.0' -I https://yourdomain.com should return HTTP/1.1 200 OK. Run both tests within 60 seconds of deploying a rule change. A misconfigured rule that blocks GPTBot will not show up in your analytics - it will show up as a citation drop in AI answers two to four weeks later, by which point the cause is hard to trace.

⚠️ Never use a wildcard block on all unknown user-agents

A blanket `deny` for any user-agent not in your whitelist will block legitimate monitoring tools, uptime checkers, and feed readers — and it will block new AI citation crawlers before you have had a chance to evaluate them. Block specific confirmed-bad strings and IPs only.

How to Verify the Block Worked and Monitor for New Crawler Variants

After deploying blocking rules, run a fresh log extraction 48 hours later using the same awk command from Step 2. The blocked user-agent strings should appear in your log with 403 response codes rather than 200. If they still show 200, your rule did not apply - check that you reloaded nginx (nginx -s reload) or restarted Apache after editing the config. A common mistake is editing the wrong config file when a site uses multiple virtual host files; confirm which file is active with nginx -T | grep server_name.

Set up a weekly automated check using a cron job that runs the user-agent extraction script and diffs the output against your whitelist. A simple Python script using the collections.Counter module can parse your access log, extract user-agent strings, and write any new unknowns to a text file. Schedule it with crontab -e to run every Sunday at 06:00. Pipe the output to an email or a Slack webhook so you do not have to remember to check manually. This is a one-time 30-minute setup that runs indefinitely.

Monitor your AI citation presence alongside your log monitoring. If you notice a drop in citations in ChatGPT or Perplexity answers for queries where you previously appeared, cross-reference the timing with any recent blocking rule changes. A citation drop that coincides with a rule deployment is a signal that you may have blocked a legitimate AI crawler. Check your 403 logs for the period: grep ' 403 ' access.log | awk -F'"' '{print $6}' | sort | uniq -c | sort -rn will show you which user-agents are being blocked most frequently.

Build a simple monthly review cadence: extract user-agents, diff against whitelist, validate new unknowns via reverse DNS and ipinfo.io, update your block list and whitelist, and document the change. This four-step cycle takes under an hour per month once the initial setup is complete. Teams that run this cadence consistently maintain a clean signal for AI citation crawlers and catch new scraper variants within a recent review window of their first appearance - well before they can accumulate significant content pulls.

Checklist

Re-run `awk -F'"' '{print $6}' access.log | sort | uniq -c | sort -rn | head -100` 48 hours after deploying rules
Confirm blocked user-agents now return `403` in the log, not `200`
Confirm `GPTBot`, `ClaudeBot`, `PerplexityBot`, and `Google-Extended` still return `200`
Set up a weekly cron job to extract new user-agent strings and diff against your whitelist
Monitor AI citation presence for the 2 weeks following any blocking rule change
Run a full monthly review: extract → diff → validate → update → document
Keep a versioned record of your block list and whitelist with dates of each change

Monthly time investment by task (minutes)

Log extraction + diff20min

Reverse DNS validation15min

Block rule update + test15min

Citation presence check10min

FAQ

How do I tell if a crawler is actually GPTBot or a spoofed user-agent?

Run a reverse DNS lookup on the requesting IP with `dig -x IP `. Legitimate GPTBot requests resolve to hostnames within `*.openai.com` and fall within OpenAI's published IP ranges at `openai.com/gptbot-ranges.txt`. If the IP resolves elsewhere, the user-agent is spoofed.

Will blocking unknown crawlers reduce my citations in ChatGPT or Perplexity?

Only if you accidentally block a known citation crawler like `GPTBot` or `PerplexityBot`. Blocking verified-unknown or scraper-class crawlers has no effect on citations from documented AI engines - those crawlers are not the source.

How often do AI crawler user-agent strings change?

Major operators like OpenAI and Anthropic update user-agent strings when they deploy significant crawler versions - roughly quarterly based on public changelog patterns. Run your log diff monthly to catch new variants within a reasonable window.

Can I use robots.txt alone to block unannounced AI crawlers?

No. Robots.txt is advisory and only works if the crawler fetches and respects it. Non-compliant crawlers ignore it entirely.

What is the fastest way to check IP reputation for a suspicious crawler IP?

Send a GET request to `https://ipinfo.io/ IP /json` - the free tier returns the owning ASN, hostname, and country with no authentication required. Cross-reference the ASN against the crawler's claimed identity.

Does GoAccess work on compressed log files?

Yes. Pipe the decompressed output directly: `zcat access.log.gz | goaccess - -o report.html --log-format=COMBINED`. GoAccess handles stdin, so you can analyze archived logs without extracting them to disk.

Written by

EdenRank Team

AI Visibility researchers and practitioners. We build tools that help growth teams see where their brand appears in AI answers - and fix what's missing.

50+Guides published

6AI engines tracked

200+Brands audited

1,200+Data points / audit

Expertise

AI answer visibility measurementCitation & source intelligenceLLM readiness & crawlabilityEntity trust & schema markupPrompt strategy & buyer signals

Published

Jun 26, 2026

About EdenRank All articles

Want insights like this for your own brand?

Talk to the team

Related guides

Keep building the topical graph.

All posts

AI visibilityMonitoring