What is Robots.txt & Why It's Blocking You From AI Search

Let’s be honest: AI search is changing how people find content. ChatGPT, Claude, Google’s AI Overviews—these aren’t just nice-to-haves anymore. They’re becoming the default way people get answers. I use them everyday and I have both ChatGPT Plus and Claude Pro subscriptions. Not proud of it but these chatbots certainly make it easier and faster.

So, as a content creator, if your content isn’t showing up in these AI-powered results, you’ll be missing out on a significant portion of traffic. That’s why SEOs are so focused on optimising for AI search. But what if you’ve already done the work and still can’t see the results?

Sure, SEO takes time but what if there’s something else altogether? What if it’s a tiny text file sitting on your website that’s blocking AI crawlers from ever seeing your content in the first place?

Yeah—sounds weird, but it can happen. Happened in my workplace a few months ago, actually. And that file? Well, it’s called robots.txt. And if you set it up years ago (or inherited it from a developer who did), there’s a good chance it’s blocking the exact crawlers that power AI search—without you even realizing it.

And that’s why in this post, we’re taking a look at what robots.txt actually is, how AI crawlers work differently from traditional search engines, and how to check if you’re accidentally making yourself invisible to the future of search. Plus, I’ll show you exactly how to fix it (or intentionally block AI crawlers if that’s your strategy). So, let’s begin.

In this post…

What is Robots.txt?

Robots.txt is a text file that tells search engines and bots what they can and can’t access on your site. It lives at the root of your domain—yourdomain.com/robots.txt—and acts like a set of instructions for crawlers.

When a bot visits your site, it checks this file first to see what you’ve allowed or blocked. If you’ve told it to stay out of certain pages or folders, it will (usually). If you’ve given it the green light, it crawls and indexes your content.

Here’s what a basic robots.txt file looks like:

User-agent: *
Disallow: /admin/
Disallow: /private/

User-agent: Googlebot
Allow: /

What This Means

User-agent: * means “all bots”
Disallow: /admin/ tells bots not to crawl anything in the /admin/ folder
User-agent: Googlebot gives specific instructions just for Google’s crawler
Allow: / lets Googlebot access everything

Simple enough, right?

The problem is, robots.txt was created back in 1994 way before AI search was a thing. It was designed to manage traditional search engines like Google and Bing, not AI crawlers like GPTBot (ChatGPT), Google-Extended, or anthropic-ai (Claude).

And if your robots.txt only allows old-school crawlers? You might be blocking AI search tools from ever seeing your content. Google’s official robots.txt documentation explains the technical details, but here’s what you actually need to know: this tiny file could be why you’re invisible to AI search.

How AI Search Crawlers Are Different from Traditional Search Engines

Traditional search engine crawlers like Googlebot and Bingbot have one job: index your content so it shows up in search results. They visit your site, read your pages, and add them to their database. When someone searches for something, your page might rank if it’s relevant.

AI search crawlers? They work differently.

Crawlers like GPTBot (OpenAI), Google-Extended (Google’s AI training bot), CCBot (Common Crawl), and anthropic-ai (Claude) don’t just index your content—they scrape it to train AI models and generate answers. When someone asks ChatGPT or Claude a question, these tools pull from the data their crawlers have collected to give direct answers, summaries, or references.

Here’s the key difference:

Traditional crawlers → Index your page → User searches → Your page appears in results → User clicks through to your site
AI crawlers → Scrape your content → User asks AI a question → AI generates an answer (sometimes citing you, sometimes not) → User might never visit your site

Why Your Site Might Be Invisible to AI Search

If you’re not showing up in AI search results, here are the most common reasons why.

1. Default blocking by your CMS or security plugins

A lot of platforms and plugins block unknown bots automatically to protect against spam and malicious crawlers. It all makes sense in theory. But GPTBot, Google-Extended, and other AI crawlers? They’re “unknown” to older systems, so they get blocked by default.

WordPress security plugins like Wordfence or Sucuri, for example, can block bots that aren’t on their approved list. Same with some hosting providers. You didn’t intentionally block AI crawlers—your setup just doesn’t recognize them as legitimate.

2. Overly restrictive robots.txt rules

Some sites use blanket blocking rules like this:

User-agent: *
Disallow: /

This tells all bots to stay out. If you’ve got this in your robots.txt (or something similar), no crawler is accessing your site unless you’ve specifically allowed it elsewhere.

Or you might have rules that only allow a few known crawlers:

User-agent: Googlebot
Allow: /

User-agent: 
BingbotAllow: / 

User-agent: *
Disallow: /

This lets Google and Bing in but blocks everything else. Including GPTBot, anthropic-ai, and every other AI crawler.

3. Outdated robots.txt files

If your robots.txt was set up in 2019 (or earlier), it doesn’t account for AI crawlers that didn’t exist yet. You’re not blocking them intentionally—you just never updated your file to include them.

4. Conflicting directives

Sometimes you’ve got multiple rules that contradict each other, or plugins adding their own robots.txt rules on top of what you manually set. This can accidentally block crawlers you meant to allow.

For example, I found this exact issue at my workplace a few months ago. We were optimizing content for AI search, doing everything right—but our robots.txt had rules that blocked anything that wasn’t Googlebot or Bingbot. We were invisible to ChatGPT, Claude, all of it. One update to the file and suddenly our content started showing up in AI search results.

Basically, if you’ve never checked your robots.txt or don’t remember the last time it was updated, there’s a very good chance you’re blocking AI crawlers without realizing it.

Common AI Crawlers You Should Know About

Not all AI crawlers are used for the same purpose. Some are used for training models, others power search features, and some do both. Here’s a list of the main ones you’ll encounter:

1. GPTBot (OpenAI/ChatGPT)

This is OpenA’s web crawler. It scrapes content to improve ChatGPT and train future models. If you block GPTBot, your content won’t show up in ChatGPT search results or be used to train their models.

2. Google-Extended (Google’s AI training crawler)

Separate from Googlebot, this crawler collects data specifically for Google’s AI products like Bard (now Gemini) and AI Overviews. Blocking this won’t affect your regular Google search rankings, but it will keep your content out of Google’s AI features.

3. CCBot (Common Crawl)

Common Crawl is a nonprofit that builds a massive, publicly accessible web archive. A ton of AI models—including many open-source ones—train on Common Crawl data. Block CCBot and you’re blocking a huge chunk of AI training sources.

4. Anthropic-ai (Claude)

This is Anthropic’s crawler for Claude. If you block it, your content won’t be referenced in Claude’s responses or used for training.

5. Bytespider (TikTok/ByteDance)

ByteDance’s crawler, used for their AI products and search features. Less talked about but increasingly relevant as TikTok expands its search capabilities.

6. Applebot-Extended (Apple Intelligence)

Apple’s AI training crawler. Separate from their regular Applebot (which powers Siri and Spotlight). Blocking this keeps your content out of Apple’s AI ecosystem.

Why you might want to allow them:

Your content gets referenced in AI search results
Increases discoverability as more people use AI tools to find information
Positions you as a resource in AI-powered platforms

Why you might want to block them:

Protect original content from being scraped for AI training
Maintain competitive advantage (especially if you’re creating premium or proprietary content)
Philosophical reasons—you don’t want your work feeding AI models without compensation

The choice is yours. But here’s the thing: you need to actively decide which crawlers to allow or block. Letting outdated robots.txt rules make that decision for you isn’t a strategy—it’s just leaving traffic on the table.

How to Check If You’re Blocking AI Crawlers

Now that you know all about robots.txt, let’s see if your robots.txt is the problem. This takes about 5 minutes, and you don’t need to be a developer to do it.

Step 1: Find your robots.txt file

Go to your website and add /robots.txt to the end of your domain. For example:

yourdomain.com/robots.txt
athistleinthewind.com/robots.txt

If the file exists, you’ll see a plain text page with rules. If you get a 404 error, you don’t have a robots.txt file (which means you’re not blocking anything but you’re also not controlling what gets crawled).

Step 2: Look for these user-agents

Scan your robots.txt for these names:

GPTBot
Google-Extended
CCBot
anthropic-ai
Bytespider
Applebot-Extended

If you see any of these followed by “Disallow,” you’re blocking that specific crawler.

For example:

User-agent: GPTBot
Disallow: /

This blocks GPTBot completely.

Step 3: Check for “Disallow” directives

Look for this:

User-agent: *
Disallow: /

The asterisk (*) means “all bots.” If you’ve got “Disallow: /” under it, you’re blocking every crawler unless you’ve explicitly allowed specific ones below.

Also watch for rules like this:

User-agent: *
Disallow: /blog/

This blocks all bots from your /blog/ folder—including AI crawlers. If your main content lives in /blog/, that’s a problem.

Step 4: Use a robots.txt tester tool (Google Search Console or online checker)

Google Search Console has a robots.txt tester that shows you exactly what’s allowed and blocked. Plug in your URL and the tool will flag issues.

Common mistakes people make:

Blocking entire sections without realizing it. You meant to block /admin/ but accidentally wrote /ad/, which also blocks /advertising/ and /advisors/.
Multiple conflicting rules. One part of your robots.txt allows a bot, another part blocks it. Crawlers usually follow the most restrictive rule.
Plugins overriding your manual settings. You updated robots.txt directly, but a security plugin is adding its own rules on top of yours.

If you find that you’re blocking AI crawlers and you want to allow them, we’ll fix that in the next section. If you’re blocking them intentionally, at least now you know it’s working.

How to Update Your Robots.txt to Allow (or Block) AI Crawlers

Now that you know what’s in your robots.txt, let’s fix it. Here’s how to do it.

If you want to ALLOW AI crawlers:

Add specific user-agent permissions to your robots.txt. Here’s what that looks like:

User-agent: GPTBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Bytespider
Allow: /

User-agent: Applebot-Extended
Allow: /

This explicitly tells each AI crawler they’re allowed to access your entire site. If you’ve got blanket blocking rules (like User-agent: * / Disallow: /), add these above that rule so the specific permissions take priority.

Important: Make sure you don’t have conflicting rules. If you allow GPTBot at the top of your file but block it later, crawlers will follow the most restrictive rule.

If you want to BLOCK AI crawlers:

Maybe you don’t want AI scraping your content for training. That’s fair. Here’s how to block specific crawlers:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

This keeps those specific bots out while still allowing traditional search engines like Google and Bing to crawl your site normally.

Why you might want to block AI crawlers:

You’re creating premium or proprietary content and don’t want it used for AI training
You want to maintain competitive advantage (your analysis, research, unique insights)
Philosophical stance—you don’t think AI companies should scrape content without compensation

Just know that blocking AI crawlers means your content won’t show up in AI search results. You’re trading discoverability for control.

How to Actually Update Your Robots.txt

Now, once you’ve checked everything and made a decision on whether you should let AI bots crawl content, here’s how you can update your robots.txt. There are several ways of doing this so choose whichever one is the easiest for you.

Option 1: Edit it directly via FTP or your hosting control panel

Log into your hosting account (cPanel, Plesk, etc.)
Navigate to your site’s root directory (usually public_html or www)
Find robots.txt and edit it
Save and upload

Option 2: Use a plugin (WordPress)

Install a plugin like Yoast SEO or Rank Math
Go to Tools → File Editor (Yoast) or General Settings → Edit robots.txt (Rank Math)
Add your rules and save

Option 3: Let your CMS handle it

Some platforms (Shopify, Squarespace, Wix) auto-generate robots.txt. You might need to adjust settings in your SEO or site settings panel instead of editing the file directly.

After you update, test it:

Go to yourdomain.com/robots.txt and make sure your changes are live
Use Google’s robots.txt tester or another checker to verify
Monitor your analytics over the next few weeks to see if AI search traffic changes

One last thing: Robots.txt is a request, not a wall. Well-behaved crawlers will respect it, but not all bots do. If you need stricter control, consider adding meta tags (<meta name=“robots” content=“noindex”>) or using your server’s .htaccess file to block specific user agents entirely.

But for most people? Updating robots.txt is enough.

What Happens If You Don’t Fix This?

Let’s say you decide not to update your robots.txt. Maybe it feels like too much work, or you’re not convinced AI search matters yet. Here’s what you’re actually risking:

1. You miss out on traffic from AI search tools

More people are using ChatGPT, Perplexity, Claude, and Google’s AI Overviews to find information instead of traditional search. If your content isn’t accessible to these tools, you’re invisible to that entire audience.

And this isn’t a small group. ChatGPT alone has over 100 million weekly active users. Perplexity is growing fast. Google’s pushing AI Overviews harder every month. These aren’t experimental features anymore—they’re how people search now.

2. Your content won’t be referenced or cited

Even if someone asks a question your content answers perfectly, AI tools can’t reference you if they’ve never crawled your site. That means:

No brand visibility in AI-generated answers
No traffic from users who want to learn more
No backlinks or mentions that come from being cited as a source

Your competitors who do allow AI crawlers? They’re getting that visibility instead.

3. The discoverability gap widens

SEO has always been about being where your audience is looking. Right now, that’s shifting toward AI search. If you’re not optimizing for it—or worse, if you’re accidentally blocking it—you’re falling behind.

Think about it: Traditional Google search is getting more competitive and harder to rank in. AI search is still relatively new, and early movers have an advantage. If you wait another year to fix this, you’ll be playing catch-up while others have already established themselves as trusted sources in AI platforms.

4. You’re making decisions by default, not strategy

Here’s the bigger issue: If you’re blocking AI crawlers because of an outdated robots.txt file you set up in 2019, that’s not a strategy. That’s just inertia.

Maybe you should block AI crawlers. Maybe you’ve got good reasons to protect your content. But that should be an intentional choice, not something that happened by accident because you never updated a text file.

At least check. At least know what you’re blocking and why. Then decide if that aligns with your goals.

Because the reality is, AI search isn’t going away. It’s only getting bigger. And if you’re serious about content marketing, SEO, or building an audience online, you need to decide how you’re going to show up in this new landscape.

I don’t like the fact that AI companies have scraped our content without consent. But in the coming years, I’m pretty confident that in the onslaught of AI slop online, people are going to value actual human voices again, so I think it’s worth it. Of course, I’d prefer our regulators actually doing something about it but until then, this is what we deal with.

Should You Even Allow AI Crawlers? (The Debate)

This isn’t a black-and-white decision, and reasonable people land on different sides.

The Case for Allowing AI Crawlers

Discoverability: If people are using AI tools to find answers and your content isn’t there, you’re missing a massive audience. Being cited in ChatGPT or Perplexity can drive traffic, build authority, and position you as a go-to source in your niche.
SEO evolution: Search behavior is changing. Optimizing for AI search now puts you ahead of people who wait until it’s saturated. Early adopters get referenced more, rank better in AI results, and build momentum while competition is still low.
Brand visibility: Even if someone doesn’t click through, being cited by AI tools gets your name in front of people. That’s brand awareness you wouldn’t have otherwise.

The Case for Blocking AI Crawlers

Content protection: If you’re creating original research, proprietary analysis, or premium content, you might not want it scraped to train models that could eventually compete with you. Blocking AI crawlers keeps your work from being commoditized.
No compensation: AI companies are using your content to build billion-dollar products, and you’re not seeing a cent. Some creators see this as exploitation and refuse to participate until there’s fair compensation. Organizations like the News Media Alliance and Authors Guild have been pushing for licensing agreements and legal protections.
Competitive advantage: If your edge is deep, unique expertise, letting AI scrape and regurgitate it dilutes that advantage. Keeping it gated (via paywalls, email sign-ups, or blocking crawlers) maintains exclusivity.

The Middle Ground

You do’’t have to choose all or nothing. Selective blocking lets you control what gets scraped:

Allow crawlers to access educational or top-of-funnel content
Block premium resources, proprietary data, or monetized content
Let search-focused crawlers in (GPTBot, anthropic-ai) but block training-focused ones (CCBot)

Some publishers are also experimenting with licensing deals. OpenAI has struck agreements with outlets like Axel Springer to use their content in exchange for payment. If you’ve got high-value content, that might be an option down the line.

Ultimately, it comes down to what you’re trying to achieve. If discoverability and traffic matter more than control, allow crawlers. If protecting your work is the priority, block them. Just make sure it’s a decision you’re making intentionally, not one your robots.txt is making for you.

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28