Robots.txt Explained: How to Control Search Crawlers
Understand robots.txt syntax, the difference between blocking crawling and blocking indexing, and the common mistakes that accidentally hide your site from Google.
8 min read
··Updated: 24 May 2026·By Helperzy Team
Robots.txt is one of the most powerful and most misunderstood files in technical SEO. A single wrong line can accidentally hide your entire site from Google, while a missing rule can let crawlers waste time on pages that should never be indexed. The file looks simple, just a few lines of text, but the consequences of getting it wrong are real. This guide explains exactly how robots.txt works, the crucial difference between controlling crawling and controlling indexing, the syntax you need to know, and the mistakes that trip up even experienced site owners.
What Robots.txt Is and How Crawlers Read It
Robots.txt is a plain text file placed at the root of your domain that tells web crawlers which parts of your site they may or may not request. When a crawler like Googlebot visits your site, the first thing it does is fetch yourdomain.com/robots.txt to check the rules before crawling anything else.
The file works on a system of trust and cooperation. It follows the Robots Exclusion Protocol, a widely respected standard. Reputable crawlers, including Google, Bing, and major SEO tools, obey it. However, the file has no enforcement power. A poorly behaved or malicious bot can simply ignore it and crawl whatever it wants. This is why robots.txt is never a security tool.
The file must live at the exact root path. For example.com, the only valid location is example.com/robots.txt. Crawlers will not look anywhere else. Importantly, robots.txt rules apply per host, so each subdomain and protocol combination needs its own file. The robots.txt for example.com does not cover blog.example.com.
If the file returns a 404, crawlers assume everything is allowed. If it returns a server error like 500, Google may temporarily stop crawling the site entirely until it can read the file, which is why server reliability for this file matters.
Understanding the Core Directives
Robots.txt uses a small set of directives, grouped into blocks that apply to specific crawlers.
User-agent: This names the crawler a block of rules applies to. 'User-agent: *' targets all crawlers. 'User-agent: Googlebot' targets only Google's main crawler. You can have multiple blocks for different crawlers with different rules.
Disallow: This tells the named crawler not to request URLs that start with the given path. 'Disallow: /admin/' blocks everything under the admin directory. 'Disallow: /' blocks the entire site, which is a common accidental disaster. An empty 'Disallow:' value means nothing is blocked.
Allow: This creates an exception to a Disallow rule. If you block /folder/ but want /folder/public-page.html crawled, you add an Allow rule for that specific path. Google processes the most specific matching rule.
Sitemap: This points crawlers to your XML sitemap location, using a full absolute URL. It is independent of user-agent blocks and can appear anywhere in the file.
A simple example: a block with 'User-agent: *' followed by 'Disallow: /cart/' and 'Disallow: /checkout/' tells all crawlers to skip the shopping cart and checkout pages while leaving the rest of the site open.
Advertisement
Crawling vs Indexing: The Critical Distinction
This is the single most important concept in robots.txt, and getting it wrong causes serious SEO problems.
Crawling is the act of a search engine requesting and downloading a page. Indexing is the act of storing that page in the search engine's database so it can appear in results. They are separate steps, and robots.txt only controls the first one.
When you disallow a URL in robots.txt, you stop Google from crawling it. But here is the catch: Google can still index a URL it has never crawled. If other websites link to that blocked URL, Google knows the URL exists and may show it in search results, usually with no description because it could not read the content. The result is an ugly, contentless listing you did not want.
To truly keep a page out of search results, you need the noindex directive, which lives in a meta robots tag on the page itself or in an HTTP header, not in robots.txt. And critically, the page must be crawlable for Google to see the noindex tag. If you block the page in robots.txt, Google never crawls it, never sees the noindex, and may index the URL anyway.
The rule of thumb: use robots.txt to manage crawling and crawl budget. Use noindex to manage what appears in search results. Never combine a robots.txt block with a noindex tag on the same page, because the block prevents the noindex from working.
Practical Patterns and Wildcards
Beyond basic paths, robots.txt supports two special characters that Google and most major crawlers understand.
The asterisk (*) acts as a wildcard matching any sequence of characters. 'Disallow: /*?' blocks any URL containing a question mark, which is useful for blocking URLs with query parameters. 'Disallow: /*.pdf$' targets PDF files specifically.
The dollar sign ($) marks the end of a URL. 'Disallow: /*.php$' blocks URLs ending in .php but not URLs that merely contain .php elsewhere in the path.
Common real-world patterns include blocking internal search result pages with 'Disallow: /search', blocking faceted navigation parameters that create duplicate content, and blocking admin or login areas. On large e-commerce sites, blocking filter and sort parameter URLs can dramatically reduce wasted crawling.
A word of caution on wildcards: they are powerful and easy to over-apply. A rule like 'Disallow: /*?' will block every URL with a query string, which might include legitimate paginated pages or tracked landing pages you actually want crawled. Always test patterns carefully against real URLs before deploying them. Even a small mistake can remove important sections from crawling, and the effects may take weeks to fully appear in your indexing reports.
Common Robots.txt Mistakes
Blocking the whole site by accident: The line 'Disallow: /' under 'User-agent: *' blocks every page on your site. This frequently happens when a staging site's robots.txt gets pushed to production. Always check this first if traffic suddenly disappears.
Using robots.txt to deindex pages: As covered above, blocking a page does not remove it from search and may leave an ugly URL-only listing. Use noindex instead.
Blocking CSS and JavaScript: Years ago people blocked these to save crawl budget. Today Google renders pages like a browser and needs your CSS and JS to understand layout and content. Blocking them can hurt how Google evaluates your pages. Leave resource files crawlable.
Case sensitivity errors: Paths in robots.txt are case sensitive. 'Disallow: /Folder/' does not block '/folder/'. Match the exact casing your URLs use.
Forgetting the file is public: Anyone can read yourdomain.com/robots.txt. Do not list sensitive directories, because you are literally publishing a map of where they are.
Trailing slash confusion: 'Disallow: /private' blocks both the /private page and everything under /private/. 'Disallow: /private/' blocks only the directory contents. Know which behavior you want.
Not testing changes: Before deploying, use a robots.txt tester to confirm your rules block what you intend and nothing more. A robots.txt validator catches syntax errors and shows which URLs each rule affects.
Building a Sensible Robots.txt File
For most sites, a good robots.txt is short and conservative. Start by allowing everything and only block what genuinely needs blocking. The safest default is no restrictions at all, with a sitemap reference added.
A typical setup for a content site might allow all crawlers full access, block administrative and account areas, block internal search result pages to avoid indexing low-value combinations, and point to the XML sitemap. That is usually enough.
For AI crawler management, robots.txt is also where you decide whether to allow or block AI training and search bots like GPTBot, Google-Extended, and PerplexityBot. Many site owners allow AI search crawlers because they can drive referral traffic through citations, while blocking pure scrapers that offer no traffic in return. This is a strategic choice based on your goals.
After writing your file, validate it, place it at your domain root, and reference your sitemap inside it. Then monitor Google Search Console for any crawl anomalies over the following weeks. If you see important pages dropping out of the index, revisit your rules immediately.
Keep the file under version control alongside your code so you can track changes and quickly roll back if a bad rule slips through. Small, deliberate edits beat sweeping changes you cannot easily undo.
Robots.txt gives you precise control over how crawlers access your site, but only if you respect the line between crawling and indexing. Use it to guide crawling and protect crawl budget, never to hide sensitive data or deindex pages. Keep the file at your domain root, write conservative rules, leave CSS and JavaScript crawlable, and always test before deploying. When paired with noindex tags for index control and an XML sitemap for discovery, a clean robots.txt becomes a reliable foundation for technical SEO.
Advertisement
Advertisement
Frequently Asked Questions
Does blocking a page in robots.txt remove it from Google?
No, and this is the most common misunderstanding. Blocking a URL in robots.txt stops Google from crawling it, but the URL can still appear in search results if other pages link to it, often with no description. To actually keep a page out of the index, allow crawling and use a noindex meta tag, or protect it with authentication. Blocking crawling actually prevents Google from seeing the noindex tag.
Where does the robots.txt file need to be located?
It must sit at the root of your domain, accessible at yourdomain.com/robots.txt. Crawlers only check that exact location. A robots.txt file in a subdirectory like yourdomain.com/blog/robots.txt is ignored. Each subdomain needs its own file, so blog.example.com and shop.example.com require separate robots.txt files even though they share a parent domain.
Is robots.txt a security measure?
No. Robots.txt is a public file anyone can read, and it only requests cooperation from well-behaved crawlers. Listing a sensitive directory in robots.txt actually advertises its existence to anyone curious. Malicious bots ignore the file entirely. For real protection, use server-side authentication, password protection, or proper access controls. Never rely on robots.txt to hide private or confidential content.
What happens if I do not have a robots.txt file?
If no robots.txt file exists, crawlers assume they are allowed to crawl everything publicly accessible. For many sites that is perfectly fine. A robots.txt file is only necessary when you want to restrict crawling of certain areas, manage crawl budget on a large site, or point crawlers to your sitemap. A missing file is not an error, but an empty or misconfigured one can cause problems.
Can I use robots.txt to manage crawl budget?
Yes, on large sites this is a valid use. By disallowing low-value URLs like internal search results, faceted navigation parameters, or duplicate filter combinations, you stop crawlers from wasting time on pages that should not be indexed. This frees crawl budget for your important pages. For small sites with a few hundred pages, crawl budget is rarely a concern and this optimization is unnecessary.