SEO in the Age of LLMs and Agentic AI
By JJ ·
SEO in the Age of LLMs and Agentic AI
For about twenty years "SEO" mostly meant one thing: get a URL into a search index, then rank it. That mental model is now incomplete. A modern page can be (a) indexed and ranked the old way, (b) cited as a source inside an AI-generated answer, (c) silently absorbed into a training set, or (d) fetched on demand by an agent acting on a user's request. Those are four different pipelines with four different control surfaces.
This post is a tour of how the surface has changed, and how I think about the controls that actually exist.
<!-- IMAGE PLACEHOLDER 1 ChatGPT prompt: "A clean editorial diagram, light background, dark labels, no clutter. Title: 'Four pipelines, one URL'. A single public URL on the left flows into four parallel lanes on the right: (1) Classic Search Index, (2) AI Answer / Citation, (3) Model Training Set, (4) Agent / User Fetch. Each lane has a small icon. Style: technical illustration, neutral palette, no logos, no specific brand marks." Save as: /public/images/notes/seo-llm/pipelines.png Alt text: "Diagram showing one URL flowing into four pipelines: classic search, AI answers, training, and agent fetches." -->The three-plus-one control planes
If you are responsible for a website in 2026, it helps to mentally separate four things every time you make a publishing decision:
- Classic indexing — am I in the index, and how do I rank?
- AI answer inclusion — when an LLM answers a user, am I being cited or summarized?
- Training inclusion — is my content sitting inside a model's weights for future generations?
- Agentic fetch — when a user tells an assistant "go read this page for me," is it allowed to?
These look similar on the surface, but the providers expose different knobs for each one. Google ties AI Overviews and AI Mode tightly to its existing search index — to be cited in those features your page must already be indexed and snippet-eligible. OpenAI splits its bots into OAI-SearchBot (powering ChatGPT Search), GPTBot (training), and ChatGPT-User (user-initiated fetches). Anthropic does the same split: Claude-SearchBot, ClaudeBot, and Claude-User. Bing/Microsoft offers a particularly strong distinction via NOARCHIVE, which can keep a page in classic search while excluding it from Bing Chat and Copilot summaries.
The takeaway is not that you need to learn ten new tags. The takeaway is that "block AI" and "block search" are no longer the same lever, and treating them as one will either over-restrict your reach or under-protect your content.
robots.txt is a traffic sign, not a fence
This one trips up almost everyone. robots.txt is a request to well-behaved crawlers about which paths to fetch. It is not a guarantee that a URL stays out of any result. Google says this explicitly: a robots.txt disallow does not reliably keep a page out of Search. OpenAI says the same thing — a page they cannot fetch can still appear as a bare link and title if they learn the URL from somewhere else. Brave says even more bluntly that robots.txt is not used to prevent indexing.
If you actually want a page suppressed from a result set, the mechanism is noindex (as a meta tag or X-Robots-Tag header), not robots.txt. And here is the catch that bites people: for a crawler to honor noindex, it has to be allowed to fetch the page. Disallowing it in robots.txt blinds the very crawler you wanted to obey your instruction.
The new crawler taxonomy worth memorizing
If you are going to learn one new piece of vocabulary this year, learn the difference between search bots, training bots, and user-triggered fetchers. Some providers run one of each.
| Provider | Search-quality bot | Training bot | User-initiated fetcher |
|---|---|---|---|
| OpenAI | OAI-SearchBot | GPTBot | ChatGPT-User |
| Anthropic | Claude-SearchBot | ClaudeBot | Claude-User |
| Google | (Googlebot, shared) | Google-Extended (Gemini training/grounding) | (n/a as separate UA) |
A few things to notice:
- The search-quality bots are how your content shows up cited inside an AI answer. If you want that visibility, you almost always want to allow them.
- The training bots are how your content ends up baked into a future model. This is the lever for "I'm fine being read live, but I do not want to be a free training corpus."
- The user-initiated fetchers are the agentic case. A user types "summarize this URL." The robot is acting on a person's instruction, and providers like OpenAI have stated that
robots.txtmay not apply to these fetches because they are not crawls. That blurs an old line. If you absolutely need to stop a user from feeding your URL into an assistant, the answer is authentication, not a robots rule.
Notably, Google-Extended does not affect Search inclusion. You can block it to keep your text out of Gemini's grounding and training while remaining fully visible in Google Search. That kind of "split allow" was not a thing five years ago, and it is the cleanest example of how the controls have multiplied.
The classic fundamentals did not retire
Here is the part that quietly upset a lot of "AI-first SEO" predictions: the fundamentals still carry the freight.
Google has repeatedly said that AI Overviews and AI Mode have no special markup requirements. There is no <meta name="ai-overview">. The page needs to be indexed, eligible for a snippet, semantically clean, and accurate to its visible content. Structured data still helps. Internal linking still helps. A real <h1> and an honest <title> still help.
So if you were already doing the boring, careful work — one canonical URL per concept, a real XML sitemap of preferred URLs, Article or BlogPosting JSON-LD on long-form posts, semantic HTML, alt text on images, accurate descriptions — you are mostly already doing GEO ("generative engine optimization") too. The new layer sits on top.
What the new layer adds is a measurement and citation surface.
Citations are the new ranking signal
Classic SEO measured clicks. AI surfaces measure citations: was your URL listed as one of the supporting sources for the model's answer?
A few practical tools exist now:
- Bing Webmaster Tools has an "AI Performance" report that tracks when your site is cited inside Microsoft Copilot and Bing's AI summaries. As of early 2026 this is the most explicit first-party citation observability surface from any major provider.
- Google Search Console rolls AI Overviews / AI Mode traffic into the regular Performance report under the Web search type. So you do not need a separate dashboard; you do need to remember that "Search" now silently includes "Search-with-AI."
- OpenAI tags referrals from ChatGPT with
utm_source=chatgpt.com, which means your analytics tool can isolate ChatGPT-origin traffic from any other source. You just need to ensure you preserve UTM parameters in your funnel.
Anthropic and Brave both surface citations in their AI answers, but neither has published a Bing-style first-party publisher dashboard yet, so for those you are reliant on log analysis and referrer headers.
<!-- IMAGE PLACEHOLDER 3 ChatGPT prompt: "A clean dashboard mockup, light theme, showing two stacked panels. Top panel: a bar chart titled 'Citations by AI surface' with bars for ChatGPT, Copilot, Claude, Brave, Gemini. Bottom panel: a small table titled 'Referral sources' with fake rows for utm_source=chatgpt.com, copilot.microsoft.com, etc. No real data, no logos, monospace numbers, infographic feel." Save as: /public/images/notes/seo-llm/citation-dashboard.png Alt text: "Mockup of an AI citation dashboard showing per-surface citation counts and referral sources." -->IndexNow and freshness
A small but useful protocol: IndexNow. You ping a URL when a page is created, updated, or deleted, and Bing (and a handful of other engines) commit to faster re-discovery. It is one line of curl in a deploy hook:
curl "https://www.bing.com/indexnow?url=https://example.com/notes/my-note&key=YOUR_KEY"
This matters more in an AI world than it used to. AI answers prefer fresh sources, and a stale snapshot of your page citing yesterday's facts is a worse outcome than a slightly slower link click. If you publish frequently, wire IndexNow into your deploy.
What I would not bother doing
A few things look like they should matter and largely do not:
- "AI-first" schemas you read about on Twitter. If a markup is not documented by the major providers, it is at best inert and at worst a distraction. Stick to schema.org types Google and Bing actually parse.
- Cloaking different content to bots vs. humans. This was already a Google policy violation. The AI era did not change that.
- Hand-tuning meta keywords. Still ignored. Still 1999.
- Worrying about being "too readable to LLMs." Trying to obfuscate prose against models is a losing arms race, and it makes your page worse for actual humans, which hurts the classic ranking layer too.
If you want a page to not be trained on, opt out via the provider's training bot. If you want it not to be cited, opt out via the provider's search bot. If you want it not to exist publicly, use authentication. There is no fourth option that is robust at scale.
The mental model in one sentence
The web used to have one funnel: crawl → index → rank → click. Now it has at least three funnels in parallel — the classic one, an answer-citation one, and a training one — and the controls have unbundled in step with them. Treat them as separate pipelines, measure them separately, and stop assuming that one robots rule covers all your obligations.
<!-- IMAGE PLACEHOLDER 4 ChatGPT prompt: "A simple, friendly closing illustration: a single signpost at a forked path with three signs labeled 'Search Index', 'AI Answers', 'Training Set'. Flat illustration style, warm neutral colors, no people, no logos. Editorial blog footer image." Save as: /public/images/notes/seo-llm/three-paths.png Alt text: "Illustration of a signpost at a forked path with three signs labeled Search Index, AI Answers, and Training Set." -->If you found this useful and want to argue with any of it, the contact link in the footer works.
Subscribe to Notes
New notes delivered when I publish. No spam.