Agent-Ready Website Infrastructure: llms.txt, MCP, robots.txt for AI

Most developer conversations about AI agents focus on building the agent. But agents need somewhere to go. They need to crawl your site, read your docs, call your APIs, and understand what actions are available — and right now, most of the web is not ready for them.

Cloudflare published adoption data in April 2026 that puts this in stark relief: while 78% of sites have a robots.txt, only 4% declare AI usage preferences, and fewer than 15 sites on the entire internet implement MCP Server Cards. The infrastructure side of the agent ecosystem is roughly where SEO was in 2001.

This article is a practical guide for engineers who want to be ahead of that curve.

TL;DR

“Agent readiness” means making your site or API easy for AI agents to discover, read, and use — it is different from building agents.
There are four layers: Discoverability, Content Accessibility, Bot Access Control, and Protocol Discovery.
llms.txt and markdown content negotiation are the two highest-leverage changes you can make today.
Cloudflare’s own docs implementation achieved 31% fewer tokens consumed and 66% faster response times after optimizing for agent consumption.
Fewer than 4% of sites have implemented any of these standards. Being early is a real advantage.

What You Will Learn Here

What agent readiness means and why it matters right now
The four dimensions of an agent-ready site
How to implement llms.txt, robots.txt AI rules, and markdown negotiation with real code
How to expose your site’s capabilities through MCP and API catalogs
A practical checklist you can action today

The Shift Happening Now

The web was built for humans. HTML, CSS, JavaScript — all of it is optimized for a browser rendering engine and a human reading the result. When a search engine crawler visits your site, it reads your content passively: index the text, follow the links, move on.

AI agents work differently. An agent visiting your site wants to:

understand what your site contains and what actions it supports
consume content with minimal noise (no navbars, scripts, ads)
know what it is and is not allowed to do
find API endpoints and capabilities it can call autonomously

Traditional crawler                AI agent
─────────────────                  ─────────────
visit URL                          visit URL
parse HTML                         look for llms.txt
extract text                       request /index.md (markdown)
follow links                       read Content Signals
store index                        find MCP server card
                                   call API or tool
                                   take action

That is a fundamentally different usage pattern. And the web has almost no signals for it yet.

The Four Dimensions of Agent Readiness

Think of agent readiness in four layers, from simplest to most powerful:

Layer 4 ─ Protocol Discovery    ┐
           MCP Server Cards     │  agents can take action
           API Catalogs         │
           Agent Skills         ┘

Layer 3 ─ Bot Access Control    ┐
           Content Signals      │  agents know what they can do
           Web Bot Auth         │
           AI bot rules         ┘

Layer 2 ─ Content Accessibility ┐
           llms.txt             │  agents can read efficiently
           Markdown negotiation │
           Structured reading   ┘

Layer 1 ─ Discoverability       ┐
           robots.txt           │  agents can find content
           sitemap.xml          │
           HTTP Link headers    ┘

Start at Layer 1 and work up. Every layer depends on the one below it.

Layer 1: Discoverability

robots.txt for AI crawlers

Most robots.txt files were written for Google and Bing. They say nothing about AI agents. The first step is to add explicit rules for the major AI crawlers.

# robots.txt

# Traditional crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# AI training crawlers (block if you don't want your content used for training)
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

# AI agents doing live tasks (usually fine to allow)
User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Point agents to your llms.txt
Sitemap: https://yourdomain.com/sitemap.xml

The distinction matters: training crawlers (GPTBot, CCBot) are harvesting your content for model training. Task agents (ClaudeBot, PerplexityBot) are visiting on behalf of a user trying to accomplish something. These are different use cases and most sites should treat them differently.

HTTP Link Headers

For single pages, you can signal related machine-readable content via HTTP headers:

Link: </llms.txt>; rel="llms"
Link: </index.md>; rel="alternate"; type="text/markdown"

In a Next.js middleware or Express handler:

// next.config.ts
export default {
  async headers() {
    return [
      {
        source: '/(.*)',
        headers: [
          {
            key: 'Link',
            value: '</llms.txt>; rel="llms", </index.md>; rel="alternate"; type="text/markdown"',
          },
        ],
      },
    ];
  },
};

Layer 2: Content Accessibility

llms.txt — the specification

The llms.txt specification defines a Markdown file at the root of your site that gives AI agents a structured reading list. It is analogous to sitemap.xml but designed for LLM context windows, not search index crawlers.

The file format:

# Your Site or Product Name

> One or two sentence description of what this site is and who it is for.

## Docs

- [Getting Started](https://yourdomain.com/docs/getting-started.md): Installation and first steps
- [API Reference](https://yourdomain.com/docs/api.md): Complete API documentation
- [Configuration](https://yourdomain.com/docs/config.md): All configuration options

## Guides

- [Authentication Guide](https://yourdomain.com/guides/auth.md): How to authenticate users
- [Deployment Guide](https://yourdomain.com/guides/deploy.md): Deploying to production

## Optional

- [Changelog](https://yourdomain.com/changelog.md): Recent changes
- [Roadmap](https://yourdomain.com/roadmap.md): Planned features

Key rules from the spec:

The H1 heading is required (site or product name)
The blockquote summary is optional but strongly recommended
H2 sections organize groups of links
Anything under ## Optional can be skipped by agents in constrained contexts
All linked pages should have markdown equivalents (more on this below)

Markdown content negotiation

The llms.txt file only works if the pages it links to are actually readable. HTML is noisy — navbars, scripts, ads, and boilerplate can triple the token cost of reading a page.

The pattern is simple: for each page at /docs/api, serve a clean markdown version at /docs/api.md.

In Cloudflare Workers or Pages:

// Cloudflare Worker: serve .md files or HTML based on request path
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const url = new URL(request.url);

    // If the agent requests /docs/api.md, serve clean markdown
    if (url.pathname.endsWith('.md')) {
      const htmlPath = url.pathname.replace('.md', '');
      const content = await getMarkdownContent(htmlPath, env);

      return new Response(content, {
        headers: {
          'Content-Type': 'text/markdown; charset=utf-8',
          'Cache-Control': 'public, max-age=3600',
        },
      });
    }

    // Otherwise serve normal HTML
    return env.ASSETS.fetch(request);
  },
};

In Next.js with App Router, you can use a route handler:

// app/docs/[slug]/route.ts
import { getDoc } from '@/lib/content';

export async function GET(
  request: Request,
  { params }: { params: { slug: string } }
) {
  const url = new URL(request.url);

  if (url.pathname.endsWith('.md')) {
    const slug = params.slug.replace('.md', '');
    const doc = await getDoc(slug);

    return new Response(doc.markdown, {
      headers: { 'Content-Type': 'text/markdown' },
    });
  }

  // normal page rendering
}

Why this matters in numbers: Cloudflare’s implementation of these patterns for their own developer docs achieved 31% fewer tokens consumed per agent visit and 66% faster response times compared to serving HTML. That is a meaningful cost and latency reduction if you are building a docs-heavy product that agents interact with.

Hierarchical llms.txt files

For large sites, a single root-level llms.txt can overwhelm an agent’s context window. The solution is hierarchical files:

/llms.txt                    ← top-level index
/docs/llms.txt               ← docs-specific index
/api/llms.txt                ← API-specific index
/guides/llms.txt             ← guides-specific index

Each subsection file covers only the pages in that directory. An agent exploring your docs section reads /docs/llms.txt without loading all of your marketing pages.

Layer 3: Bot Access Control

Content Signals

Content Signals is an emerging standard that lets site owners declare AI usage preferences inline in their pages — not just in robots.txt.

<head>
  <!-- Allow AI agents to read content but not use it for training -->
  <meta name="ai-usage" content="no-training, allow-agent-access">

  <!-- Declare licensing terms for AI consumption -->
  <meta name="ai-license" content="CC-BY-4.0">
</head>

As of April 2026, only about 4% of sites implement any form of Content Signals. The standard is still evolving, but adding these meta tags now costs nothing and starts building a machine-readable record of your preferences.

Web Bot Auth

Web Bot Auth is a newer protocol that lets servers challenge AI agents to prove they have authorization for a given action. This is relevant when your site has both public content (no auth needed) and sensitive operations (require verified agent identity).

The flow looks like this:

Agent visits /api/sensitive-action
Server returns 401 + WWW-Authenticate: BotAuth realm="agent-actions"
Agent presents signed token or OAuth credential
Server verifies and grants access

This is similar to how MCP handles tool authorization — only escalate to auth when the agent crosses into protected territory.

Layer 4: Protocol Discovery

This is where your site goes from being readable to being actionable.

llms-full.txt vs llms.txt

The spec differentiates between:

llms.txt — a concise reading list for constrained contexts (used by default)
llms-full.txt — the complete content of all pages, concatenated, for agents that want to load everything at once

Generate llms-full.txt as part of your build:

// scripts/generate-llms-full.ts
import fs from 'fs';
import path from 'path';
import { getAllDocs } from './content';

async function generateLlmsFullTxt() {
  const docs = await getAllDocs();

  const content = docs
    .map((doc) => `# ${doc.title}\n\nURL: ${doc.url}\n\n${doc.markdown}`)
    .join('\n\n---\n\n');

  fs.writeFileSync(path.join(process.cwd(), 'public', 'llms-full.txt'), content);
  console.log(`Generated llms-full.txt with ${docs.length} pages`);
}

generateLlmsFullTxt();

MCP Server Cards

A Model Context Protocol Server Card is a /.well-known/mcp.json file that declares what MCP tools and resources your server exposes. This is the most powerful signal you can add — it tells any MCP-compatible agent exactly what capabilities are available.

// /.well-known/mcp.json
{
  "name": "Your Product MCP Server",
  "version": "1.0.0",
  "description": "Provides tools to interact with Your Product's API",
  "server_url": "https://api.yourdomain.com/mcp",
  "auth": {
    "type": "oauth2",
    "authorization_url": "https://yourdomain.com/oauth/authorize",
    "token_url": "https://yourdomain.com/oauth/token",
    "scopes": ["read", "write"]
  },
  "tools": [
    {
      "name": "search_docs",
      "description": "Search the product documentation",
      "input_schema": {
        "type": "object",
        "properties": {
          "query": { "type": "string", "description": "Search query" },
          "limit": { "type": "integer", "default": 10 }
        },
        "required": ["query"]
      }
    },
    {
      "name": "get_account_info",
      "description": "Get the current user's account information",
      "auth_required": true
    }
  ]
}

An MCP-compatible agent that visits your site can discover this file, understand what tools are available, and use them — without you having to manually register with any agent platform.

Agent Skills Index

For sites that expose multiple capabilities, you can provide an Agent Skills index that catalogs what your site can do in natural language:

// /.well-known/agent-skills.json
{
  "skills": [
    {
      "id": "search",
      "name": "Search Documentation",
      "description": "Search across all product docs and guides",
      "endpoint": "/api/search"
    },
    {
      "id": "purchase",
      "name": "Purchase Products",
      "description": "Browse catalog and complete purchases",
      "endpoint": "/api/commerce",
      "auth_required": true
    }
  ]
}

Putting It All Together: A Site Structure

Here is what a fully agent-ready site looks like:

yourdomain.com/
├── robots.txt               ← AI crawler rules
├── sitemap.xml              ← standard sitemap
├── llms.txt                 ← agent reading list
├── llms-full.txt            ← full concatenated content
├── .well-known/
│   ├── mcp.json             ← MCP server card
│   └── agent-skills.json    ← capability catalog
├── docs/
│   ├── llms.txt             ← docs-specific reading list
│   ├── getting-started.md   ← machine-readable doc
│   └── api.md               ← machine-readable API docs
└── guides/
    ├── llms.txt             ← guides-specific reading list
    └── auth.md              ← machine-readable guide

And the HTTP headers every page should return:

Link: </llms.txt>; rel="llms"
Link: </index.md>; rel="alternate"; type="text/markdown"
X-Robots-Tag: ai-training:noindex

Validation: Use isitagentready.com

Cloudflare built a free scanner at isitagentready.com that audits your site across all four layers and gives you a score with specific recommendations. Run it after implementing each layer to verify your changes are detectable.

The scanner checks:

robots.txt with AI-specific directives
llms.txt presence and validity
Markdown content negotiation
Content Signals meta tags
/.well-known/mcp.json or agent-skills.json

It also exposes itself as an MCP server — so agent-ready tools can scan other sites programmatically.

Implementation Checklist

Work through these in order. Each item is independently valuable.

Day 1 (30 minutes)

Update robots.txt with explicit rules for AI training crawlers vs. task agents
Create /llms.txt with an H1, a blockquote summary, and links to your 5–10 most important pages
Run isitagentready.com to get a baseline score

Week 1

Add markdown versions of your top 10 docs pages (.md URL pattern)
Add HTTP Link headers pointing agents to your llms.txt
Add Content Signals meta tags for AI usage preferences
Create /llms-full.txt in your build pipeline

Month 1

Build out hierarchical llms.txt files per docs section
Serve markdown for all docs/guide pages (not just top 10)
Create /.well-known/mcp.json if you have an API
Add Web Bot Auth for protected endpoints

What to Expect

These standards are early. llms.txt was proposed in 2025 and is gaining adoption slowly. MCP Server Cards are even newer. You will not see dramatic traffic changes immediately.

What you will see:

AI coding assistants like Claude Code and Cursor will consume your docs more accurately (direct token cost savings for your users)
Products that index the web for agents will prioritize agent-ready sites
As agentic commerce grows, sites with capability declarations will be discoverable by purchasing agents

The Cloudflare data is the clearest signal: the gap between agent-ready and agent-unready sites is already measurable in token counts and latency. A site that is 31% cheaper for agents to read is a site that gets used more.

The investment is small. A llms.txt and a few markdown pages take an afternoon. An MCP Server Card takes a day. Being one of the roughly 15 sites with a full implementation today is table stakes positioning for a web where every user has an agent working on their behalf.

Sources

Cloudflare Blog, Agent Readiness Score: Making Websites AI-Compatible — April 2026. Source for adoption data (78% robots.txt, 4% Content Signals, 3.9% markdown negotiation, <15 MCP Server Card implementations) and Cloudflare docs performance metrics (31% fewer tokens, 66% faster).
llms.txt specification — Official specification for the /llms.txt file format, including file structure, link format, and Optional section semantics.
Cloudflare Developers, Cloudflare Agents — Technical documentation for building agents on Cloudflare Workers and Durable Objects; relevant background on how production agents consume external content.
isitagentready.com — Free agent readiness scanner; reference for the five compliance categories audited (Discoverability, Content Accessibility, Bot Access Control, Protocol Discovery, Commerce).
Model Context Protocol, MCP Specification — Protocol specification for MCP tools, resources, and server cards used in the Protocol Discovery layer.

Luis Mori Guerra

Recent Articles

Topics

Agent-Ready Infrastructure: How to Make Your Website Work With AI Agents