What robots.txt actually does
robots.txt is a plain-text file at the root of your host (e.g. https://example.com/robots.txt) that tells crawlers which paths they may or may not fetch. It is a request, not a security fence. Anything sensitive belongs behind authentication, not a Disallow rule.
A valid robots.txt is organized into groups. Each group starts with one or more User-agent: lines and is followed by Allow: / Disallow: rules that apply to that agent.
Everything happens in your browser. Presets, rules and previews never touch a server.
Anatomy of a robots.txt
# Comment
User-agent: *
Disallow: /admin
Allow: /admin/public
Crawl-delay: 10
Sitemap: https://example.com/sitemap.xml
User-agent: *targets every bot.Disallow: /adminblocks anything below/admin.Allow: /admin/publicreopens a sub-path (Allow always beats Disallow of a shorter prefix in Google's implementation).Crawl-delay: 10is honored by Bing and Yandex, mostly ignored by Google.Sitemap:lines are absolute URLs and can appear anywhere in the file.
The four presets — when to use each
| Preset | When to use |
|---|---|
| Allow everything | Default for public sites. |
| Block everything | Staging and preview environments only. |
| WordPress defaults | Hides /wp-admin and /wp-includes while keeping AJAX open. |
| Shopify defaults | Blocks /admin, /cart, /checkout and duplicate-content tag URLs. |
| Block common AI bots | GPTBot, ClaudeBot, PerplexityBot, CCBot, Google-Extended. |
Blocking AI crawlers in 2025
All major AI companies now respect robots.txt directives. Use dedicated groups for each agent:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: *
Allow: /
Google-Extended is Google's training crawler. Blocking it does not affect Search — a separate group (Googlebot) is used for indexing.
Five mistakes we see the most
- Disallowing your own assets. Blocking
/wp-content/or/static/prevents Google from rendering your pages properly. - A single trailing
/*.Disallow: /*accidentally blocks the entire site. - Using robots.txt for privacy. Search engines respect it; scrapers do not. Anything sensitive belongs behind auth.
- Forgetting the sitemap line. Always list your sitemap here — it is discovered even when nobody knows the URL.
- Multiple contradictory groups. Merge duplicate
User-agentblocks into one — crawlers are allowed to pick only one group per file.
Common patterns
Allow only the homepage
User-agent: *
Allow: /$
Disallow: /
Block a single query parameter
User-agent: *
Disallow: /*?utm_source=
Different rules for image bots
User-agent: Googlebot-Image
Allow: /images/
Disallow: /
User-agent: *
Allow: /
After you copy the file
- Upload it to the root of your host (Cloudflare, Vercel, Netlify — anywhere).
- Test with the robots.txt Tester in Google Search Console.
- Re-fetch your sitemap and any newly-blocked pages in Search Console to trigger a re-crawl.
Anything you want private must be behind authentication. Disallow: /secret tells honest bots to skip that URL — but the URL itself is public.
FAQ
Where do I put robots.txt? At the top of the host. https://example.com/robots.txt. Subfolders and subdomains do not inherit — every host needs its own.
Does Google respect Crawl-delay? No. Set your crawl rate inside Search Console instead.
Can I have multiple sitemap lines? Yes. Each Sitemap: line is treated independently and can point to a sitemap index or individual sitemap.
What about case sensitivity? Rules are case-sensitive for paths but case-insensitive for User-agent and directive names.