By Saurav | Founder of saavos | Building in public toward $10k MRR
[!TLDR] Generative Engine Optimization (GEO) is the new layer of work that makes your site cite-worthy for ChatGPT browsing, Claude research, Perplexity, and Google AI Overviews. The five things that matter most in 2026: a clean
llms.txtat your site root, an AI-crawler-friendlyrobots.txt, structured data on every meaningful page (Article + FAQPage + BreadcrumbList JSON-LD), short factual sentences agents can quote without rewriting, and explicit citation-friendly headings. Do those five and AI assistants will start referencing you within weeks. Skip them and you stay invisible to the fastest-growing search surface of the decade.
Update 2026-05-18 (post-publish correction): Google's AI Optimization Guide (published 2026-05-15) explicitly states: "You don't need to create new machine readable files, AI text files, markup, or Markdown to appear in generative AI search" and "Structured data isn't required for generative AI search." That directly contradicts items #1 (llms.txt) and #3 (JSON-LD) in the table below, where I called them the highest-leverage GEO moves.
The honest correction: schema and llms.txt still help Googlebot understand your pages, and non-Google LLMs (Perplexity, Claude, ChatGPT) do use llms.txt. Keep them. But per Google's own guide, the actual lever for AI Overview citations is non-commodity content with a unique point of view — something no machine-readable file can manufacture for you. The "must-do" framing in the table has been moderated: items #1 and #3 are useful for page-understanding and non-Google crawlers, not AI Overview entry tickets.
Generative Engine Optimization is the practice of structuring your site so that AI assistants can find you, parse you, quote you accurately, and recommend you to the user who asked the question. It's adjacent to SEO but the optimization targets are different. SEO ranks against a list of blue links. GEO competes for citation slots inside an AI-generated answer — typically 1 to 5 references, with the top one driving the lion's share of click-through.
Three concrete differences:
GEO is not about gaming AI assistants with prompt injections, hidden text, or schema markup that doesn't match content. Anthropic, OpenAI, and Perplexity all explicitly de-rank pages that try this. The work is honest: be quotable, be structured, be cite-worthy.
Ranked by how much they move the needle for a small or mid-size site, based on what the GEO research community has converged on as of mid-2026.
| Rank | Move | Effort | Effect | Where it lives |
|---|---|---|---|---|
| 1 | Publish llms.txt at site root | 30 min | Largest single signal — directly tells AI agents what to index | /llms.txt |
| 2 | Allowlist AI crawlers in robots.txt | 5 min | Prevents accidental opt-out via default deny patterns | /robots.txt |
| 3 | Article + FAQPage + BreadcrumbList JSON-LD | 2–4 hrs | Helps assistants extract Q&A and attribute correctly | Each post |
| 4 | Crisp factual sentences with concrete data | Ongoing | Quotability — the actual ranking signal | Body content |
| 5 | Citation-friendly H2s and TL;DR boxes | 1 hr per post | Agents prefer extracting from labeled sections | Page structure |
The first two are one-time and high-leverage. The last three are an editorial habit you build over months.
llms.txt (the largest single signal)llms.txt is a markdown file at the root of your site that tells AI agents what your most important pages are, in priority order, with descriptions. It was proposed by Jeremy Howard in 2024 and adoption hit critical mass in 2025 — by 2026, every major AI assistant checks it first when crawling a domain.
Put it at https://example-domain.com/llms.txt (NOT under /static/ or in a subdirectory). Format:
# Your Site Name
> One-line description of what your site does.
## Product
- [Homepage](https://example-domain.com/): Product overview
- [Pricing](https://example-domain.com/pricing): What it costs
## Blog
- [Most important post](https://example-domain.com/blog/post): One-sentence description
- [Second post](https://example-domain.com/blog/another): One-sentence description
## About
- [About the team](https://example-domain.com/about): Who built this
## Optional
- [RSS Feed](https://example-domain.com/blog/rss.xml): Latest updates
Three rules: descriptions should be a single declarative sentence, the order matters (most important first), and update it whenever you ship a new top-level page. AI assistants do not retry hourly — a stale llms.txt will keep recommending stale pages for weeks.
robots.txtMost sites running CDN security defaults block AI crawlers without realizing it. The default Cloudflare bot-management ruleset, for example, denies GPTBot, ClaudeBot, PerplexityBot, and Google-Extended unless you explicitly allow them.
Add this block to /robots.txt:
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: CCBot
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: cohere-ai
Allow: /
User-agent: anthropic-ai
Allow: /
Whether to allow each is a content-strategy call. If you want AI assistants to cite you, you have to let them in. Disallowing AI crawlers and then wondering why you don't appear in ChatGPT browsing results is the #1 self-inflicted GEO mistake we see.
If your CDN has a separate "AI bot" toggle, flip it to "allow." Verify with curl -A "GPTBot/1.0" https://example-domain.com/llms.txt — you should get a 200, not a 403.
Every meaningful page should emit structured data via JSON-LD <script> tags. AI assistants read these to extract author, date, headline, FAQ pairs, and breadcrumb hierarchy without having to parse the rendered HTML.
Three schemas you must ship:
Article (or BlogPosting) — every blog post and content page:
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "BlogPosting",
"headline": "Post title here",
"datePublished": "2026-05-05",
"dateModified": "2026-05-05",
"author": {
"@type": "Person",
"name": "Author Name",
"url": "https://example-domain.com/about/author"
},
"publisher": {
"@type": "Organization",
"name": "Your Brand",
"url": "https://example-domain.com"
},
"mainEntityOfPage": "https://example-domain.com/blog/post-slug"
}
</script>
FAQPage — any post with FAQ-style content:
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "How do I do X?",
"acceptedAnswer": {
"@type": "Answer",
"text": "You do X by..."
}
}
]
}
</script>
BreadcrumbList — every page that's deeper than the homepage:
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [
{
"@type": "ListItem",
"position": 1,
"name": "Home",
"item": "https://example-domain.com/"
},
{
"@type": "ListItem",
"position": 2,
"name": "Blog",
"item": "https://example-domain.com/blog"
},
{ "@type": "ListItem", "position": 3, "name": "Post Title" }
]
}
</script>
The single most-cited mistake: structured data that contradicts the visible content. AI assistants cross-check. If your headline claims one thing and the <h1> says another, the page gets de-prioritized.
This is the editorial habit, and it's where most sites fail.
A quotable sentence is:
The test: open ChatGPT, paste your H2 as a question, see if any of your sentences survive when summarized. If they all collapse into "this site offers chatbots," you have marketing copy, not quotable content.
AI assistants prefer extracting from labeled sections. Two patterns that compound:
TL;DR / summary at the top. A 50-to-100-word summary at the start of every long post. Use a callout box (> [!TLDR] in markdown is a common convention) so the structure is unmistakable. Many AI extraction pipelines preferentially quote from the first labeled summary on a page.
Question-style H2s and H3s. "How does RAG handle updates?" gets cited more often than "Updates and refresh cadence." When users ask AI assistants questions, the agents look for headings that match the user's phrasing. Mirror likely user queries in your headings.
Bonus pattern: a comparison table near the top with concrete numbers (price, speed, accuracy) is the single most-quoted element across the assistant ecosystem. Tables get extracted verbatim and cited as authoritative reference data.
A short anti-pattern list, by frequency of wasted effort:
Week 1: Audit. Pull your top 20 traffic pages. For each, check (a) is there structured data? (b) does the H1 match a likely user question? (c) is the TL;DR quotable? (d) does robots.txt allow AI crawlers? Most teams find 60–80% of pages need work.
Week 2: Foundations. Ship llms.txt, fix robots.txt, add BreadcrumbList JSON-LD site-wide. These are one-time and unlock everything else.
Week 3: Per-page upgrades. Add Article + FAQPage JSON-LD to every post and high-intent landing page. Rewrite TL;DRs to be 50–100 words with concrete claims.
Week 4: Editorial. Update your style guide so future content ships with quotable sentences, question-style headings, and tables of concrete data. This is the habit that compounds.
After 30 days, monitor: do you start showing up in ChatGPT browsing? Set the bot to "Browse with Bing" mode and ask a question your site should answer. Repeat in Claude (with web search), Perplexity, and Google AI Overviews. By day 60 you should see references; by day 90 they should be consistent for your top topics.
saavos is built around exactly this pattern: every page emits Article + FAQPage + BreadcrumbList structured data, our llms.txt advertises every blog post with a one-sentence description, our robots.txt explicitly opts AI crawlers in, and every post ships with a quotable TL;DR. If you want to see how the pieces fit together, the source is open inspectable in your browser's view-source.
If you want a structured list to work through before deploying, the AI chatbot evaluation checklist covers the configuration questions that affect how well a bot performs under GEO-optimized content. And if you're still clarifying what kind of tool saavos actually is before deciding whether it fits your stack, what saavos is not draws the lines clearly.
Start free on saavos — paste your URL, get a chatbot that's already optimized for AI citation, no GEO consultant required. Or see our pricing for what each paid tier unlocks.
Get the next post in your inbox
Honest writing on building, embedding, and shipping AI chatbots. No spam. Unsubscribe anytime.
GEO is the practice of structuring your site so AI assistants like ChatGPT, Claude, and Perplexity can find, parse, quote, and recommend it. It is adjacent to SEO but optimizes for citation slots inside an AI-generated answer rather than ranking against a list of blue links. SEO ranks pages; GEO ranks sentences. The work overlaps but the editorial habits differ — GEO rewards crisp factual claims, structured data, and quotable summaries far more than keyword density.
llms.txt is a markdown file at the root of your site (https://yourdomain.com/llms.txt, never under a subdirectory) that tells AI agents what your most important pages are, in priority order, with one-sentence descriptions. It was proposed by Jeremy Howard in 2024 and is checked first by every major AI assistant when crawling a domain in 2026. A typical file lists product, blog, and about pages with a one-line summary of each.
If you want AI assistants to cite you, allow GPTBot (OpenAI/ChatGPT), ClaudeBot and anthropic-ai (Anthropic), PerplexityBot, Google-Extended (Google AI Overviews and Gemini), Applebot-Extended, CCBot (Common Crawl), and cohere-ai. Most CDN security defaults block these by default. Verify with curl -A "GPTBot/1.0" against your llms.txt — you should get a 200 response, not a 403.
Strongly recommended in 2026. AI assistants read JSON-LD <script> tags to extract author, date, headline, FAQ pairs, and breadcrumb hierarchy without parsing the rendered HTML. Ship Article (or BlogPosting), FAQPage, and BreadcrumbList schemas on every meaningful page. The biggest mistake is structured data that contradicts the visible content — assistants cross-check, and mismatch is a direct de-rank signal.
For a small to mid-size site that ships the five core GEO moves (llms.txt, AI-crawler robots.txt, JSON-LD structured data, quotable sentences, citation-friendly headings), references typically start appearing within 30–60 days and become consistent for top topics by day 90. The lag is mostly crawl frequency rather than ranking lag — once an assistant has indexed you, citation depends on whether your sentences are quotable.
Three patterns we see repeatedly: (1) accidentally blocking AI crawlers via CDN security defaults — check that GPTBot, ClaudeBot, and PerplexityBot can actually fetch your llms.txt; (2) marketing-heavy copy that does not survive summarization — vague claims like "the most powerful platform" get filtered out, while concrete claims like "answers in under 2 seconds with citations" get quoted; (3) missing or contradictory structured data, which assistants cross-check against visible content.
Builds tools for solopreneurs and small SaaS teams who don't have an afternoon to spare.
Paste your URL. Train your bot. Drop one script tag. No credit card.