AI search guide

AI Crawler Robots.txt Guide - GPTBot, Google-Extended, ClaudeBot Rules

Check how AI crawlers use robots.txt, see allow and block examples, and test whether Googlebot, GPTBot, ClaudeBot, PerplexityBot, and other crawlers can access key pages.

check AI crawler access Browse guides

Direct answer

AI crawlers can be controlled with robots.txt user-agent rules. Some crawlers affect search indexing, while others affect AI training, AI answers, or page retrieval. Use crawler-specific rules to allow or block each bot intentionally.

What AI crawlers read in robots.txt

AI crawlers read the public robots.txt file at the root of a host, then match user-agent groups, allow rules, disallow rules, wildcard paths, and sitemap hints against the exact URL being requested. A useful ai crawlers robots.txt review checks the final canonical host, not only a staging domain or homepage.

robots.txt is a policy file for compliant crawlers. It can say whether a crawler is allowed to request a path, but it is not authentication, not a noindex directive, and not proof that the crawler will use, rank, train on, or cite the page.

How Googlebot differs from Google-Extended

Googlebot is tied to Google Search crawling and indexing workflows. If Googlebot is blocked from important public URLs, Google Search may not be able to discover, crawl, or refresh those pages reliably.

Google-Extended is a separate Google control token for certain AI product uses. A googlebot robots txt audit should not treat Googlebot and Google-Extended as the same rule. Run a google robots check for both tokens when reviewing search crawling and AI policy.

Compare Google-Extended vs Googlebot robots.txt rules

How major AI crawlers identify themselves

Crawler identity usually appears as a user-agent token. OAI-SearchBot, GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, CCBot, Applebot-Extended, Amazonbot, Bytespider, Googlebot, and Google-Extended can all resolve to different policy decisions.

Do not assume one robots.txt rule covers every product from the same company. OpenAI search discovery, GPTBot model-training access, user-triggered browsing, Google Search crawling, and AI product controls should be reviewed as distinct crawler policy surfaces.

Review GPTBot robots.txt rules Run an AI crawler test

AI crawler user-agent tokens to review

Common AI-related tokens include OAI-SearchBot, GPTBot, ChatGPT-User, Google-Extended, ClaudeBot, PerplexityBot, CCBot, Applebot-Extended, Amazonbot, and Bytespider. Each token can represent a different product purpose, so avoid treating all AI crawlers as the same.

Googlebot and Google-Extended are separate. Blocking Google-Extended does not mean blocking Google Search crawling. If you want to change Google Search indexing, review Googlebot and page-level indexability separately.

OAI-SearchBot: OpenAI automatic search discovery.
GPTBot: OpenAI model-training crawler.
ChatGPT-User: user-triggered browsing and retrieval requests.
Google-Extended: control token for certain Gemini and Vertex AI uses.
PerplexityBot: answer engine crawling and retrieval.
CCBot: Common Crawl collection used by many downstream systems.

AI crawler directory Googlebot robots.txt checker Google-Extended robots.txt checker GPTBot robots.txt checker ClaudeBot robots.txt checker PerplexityBot robots.txt checker

Crawler reference directory

Some crawler tokens need conservative, source-specific handling because their downstream product impact can differ from normal search indexing. Review Amazonbot, Applebot-Extended, Bytespider, and CCBot as individual profiles instead of copying one generic AI crawler rule across all of them.

Use descriptive internal links and exact user-agent tests so a crawler can discover the reference page and a site owner can confirm whether the deployed robots.txt file matches the intended policy.

How to allow AI crawler access

A simple allow policy makes AI crawler access explicit. Test the rule at the final canonical host because staging rules, redirects, and path-specific disallow rules can change the result.

Use explicit user-agent groups when the business policy is specific. A wildcard allow can be fine, but it is harder to audit when different AI crawlers have different purposes.

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: PerplexityBot
Allow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

How to block AI crawler access

A specific block policy restricts one AI crawler without necessarily changing search indexing or every other crawler. Blocking is a policy decision, not an authentication layer.

If the intent is to restrict a training or AI product control token but keep Google Search crawling, avoid blocking Googlebot by accident. If the intent is to restrict Search crawling, review noindex, canonical, sitemap, and Googlebot rules together.

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Googlebot
Allow: /

How to test whether a crawler is allowed

Start with a robots.txt checker on the homepage, then test important sections such as documentation, pricing, support, guide pages, and article URLs. A root allow can coexist with deeper disallow rules that block useful pages.

For a reliable ai crawler test, check the final URL after redirects, the tested path, the matched user-agent group, the matched directive, and whether the status is allowed, blocked, or unspecified. Unspecified is not automatically bad, but explicit policy is easier to audit.

Open the AI crawler robots.txt checker

Common robots.txt mistakes

Common mistakes include leaving development blocks in production, testing only User-agent: *, blocking documentation while allowing the homepage, and describing a Google-Extended rule as if it controlled normal Google Search indexing.

Another mistake is listing important pages in llms.txt or sitemap.xml while robots.txt blocks the same URLs. Keep robots.txt, sitemap.xml, canonical tags, and llms.txt aligned so crawlers receive consistent signals.

Understand what llms.txt is Use a practical llms.txt example

How to run an AI crawler test

An AI crawler test should answer one narrow question: for this exact URL path, does the deployed robots.txt policy allow, block, or leave the crawler unspecified? The test should use the final canonical hostname, because a rule on example.com does not always match www.example.com, a staging host, or a redirected URL.

Run the test for at least three paths: the homepage, one high-value guide or documentation page, and one commercial page such as pricing or product. This catches cases where User-agent: * allows the root path but a deeper Disallow rule blocks the content that AI search systems would actually need to read.

Fetch /robots.txt from the canonical host.
Choose a crawler token such as Googlebot, Google-Extended, GPTBot, ClaudeBot, or PerplexityBot.
Test the exact page path that should be discoverable.
Record the matched rule and whether the result is allowed, blocked, or unspecified.

AI crawler robots.txt checker

Googlebot robots.txt vs Google-Extended robots.txt

A googlebot robots txt review is different from a Google-Extended review. Googlebot is the token that matters when you are asking whether Google Search can crawl and refresh public pages. Google-Extended is a separate token for certain Google AI product controls.

This distinction matters during migrations and policy reviews. Blocking Google-Extended only is not the same as blocking Google Search. Blocking Googlebot, however, can reduce Google Search crawling and indexing for the blocked paths.

User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Disallow: /

User-agent: *
Allow: /

Google-Extended vs Googlebot robots.txt guide

How to check if Google robots are allowed

A google robots check should inspect both the user-agent token and the path. A homepage may be allowed while /guides/, /docs/, /pricing/, or /support/ is blocked. The answer should include the tested URL, the Google crawler token, the matched directive, and the effective result.

For google robots txt check workflows, do not stop at robots.txt. Confirm that the page returns 200, has a self-referencing canonical, is not noindexed, and appears in the sitemap when it is intended for indexing. robots.txt answers crawl permission; it does not prove indexability.

run a Google robots check AI citation readiness report

Related robots and AI discovery checks

Use this guide as the hub for AI crawler policy. Then validate the crawler-specific page that matches your question: Googlebot and Google-Extended for Google crawler policy, GPTBot for OpenAI training controls, and llms.txt pages for the emerging AI discovery convention.

A strong crawler policy does not guarantee AI citation or LLM visibility. It only removes one technical barrier before content quality, schema extractability, source clarity, and Search Console behavior are reviewed.

Test allow and block crawler rules Google-Extended vs Googlebot guide GPTBot robots.txt guide What is llms.txt guide llms.txt example guide AI citation readiness report

When to use the AI Crawler Robots.txt Checker

Use the AI Crawler Robots.txt Checker before a launch, after a migration, after adding a new robots.txt template, and before requesting indexing for important pages in Search Console. It is also useful when rankings drop after a deploy and you need to confirm whether an accidental block changed crawler access.

Use it as a robots.txt checker test allow block crawler workflow: test a path, read the matched directive, compare the result with business intent, then fix the most specific rule first. After the rule changes, retest the same paths so the before-and-after result is easy to document.

AI crawler robots.txt checker AI citation readiness report

Related AI Index Check tools

check AI crawler access