AI Quality & Evaluation

RAG Apps in Practice: Structured Knowledge, Better Retrieval, and Real Evals

A practical guide for engineers and PMs designing RAG apps over well-structured, chained knowledge: how to chunk, retrieve, rerank, cite, evaluate, and improve with evidence instead of vibes.

28 min read Updated Jun 15, 2026

TL;DR

  • RAG is not “upload documents and ask questions.” Production RAG is a knowledge system: ingestion, chunking, metadata, retrieval, reranking, answer synthesis, citation, evaluation, monitoring, and repair.
  • The biggest design mistake is treating well-structured knowledge as flat text. Policies, tickets, specs, contracts, manuals, and architecture docs usually need hierarchy, metadata, stable IDs, parent-child chunks, and relationship-aware retrieval.
  • Benchmarks keep showing the same uncomfortable truth: naive RAG helps, but it is not enough. CRAG reports that strong LLMs alone answer at most 34% of its factual questions, straightforward RAG reaches 44%, and strong industry RAG systems reach 63% without hallucination.
  • Evaluate the chain, not only the final answer. Measure retrieval recall, context precision, grounding or faithfulness, answer relevance, refusal behavior, citation quality, latency, cost, and production drift.
  • The best current pattern is hybrid: structured ingestion + semantic and lexical retrieval + metadata filters + reranking + source-aware generation + automated evals + human review on high-risk cases.
  • For connected knowledge, consider hierarchical retrieval, contextual chunks, GraphRAG-style summaries, or knowledge graph retrieval. Use them when the question needs relationships, aggregation, multi-hop reasoning, or corpus-level synthesis.
  • A strong default stack is Postgres for metadata and eval data, Qdrant or pgvector for hybrid retrieval, OpenAI or Voyage embeddings, a dedicated reranker, LlamaIndex or thin custom orchestration, and RAGAS/LangSmith/LangWatch-style evals.
  • If you want a Python-first full-stack implementation, Agno can cover a lot of the app surface: knowledge bases, vector DB integrations, agentic RAG, structured outputs, tools, AgentOS APIs, sessions, traces, RBAC, UI integration, and built-in evals.

What You Will Learn Here

  • How to think about RAG as a chain of decisions, not a single vector search call.
  • What recent RAG benchmarks say about naive retrieval, hallucination, dynamic facts, and multi-hop questions.
  • How to design chunks for structured and chained knowledge.
  • Which evaluation metrics matter for engineers and PMs.
  • A practical reference architecture you can adapt for production apps.
  • A concrete support-policy assistant example with source IDs, retrieval flow, JSON output, and eval checks.
  • A recommended RAG technology stack for MVPs, serious product teams, and enterprise knowledge systems.
  • How to assemble the same kind of full-stack RAG app with Agno and its built-in features.
  • A small Python-style example for checking retrieval and grounded answers.

This article is for engineers and PMs building AI apps over product docs, internal knowledge bases, support histories, policies, specs, contracts, runbooks, customer records, or other business knowledge.

The friendly warning: RAG is one of the easiest AI patterns to demo and one of the easiest to mis-measure.

RAG Is a Chain, Not a Feature

The original RAG idea is simple: retrieve external knowledge, put it into the model context, and generate an answer grounded in that knowledge. That is still the core move.

But in practice, a RAG app is a chain:

User question
     |
     v
Intent + query understanding
     |
     v
Retriever selection + filters
     |
     v
Hybrid search over structured chunks
     |
     v
Reranking + deduplication + context packing
     |
     v
Answer generation with citations
     |
     v
Post-checks: grounding, policy, format, refusal
     |
     v
Telemetry + evals + dataset updates

Every step can fail.

If chunking breaks a table, retrieval may miss the answer. If retrieval returns the right page but the wrong section, the model may guess. If the answer is correct but citation IDs are unstable, nobody can audit it. If evals only score final answer style, the team may never notice that retrieval recall got worse after a corpus update.

So the design question is not:

Which vector database should we use?

It is:

How do we preserve the knowledge structure needed to answer the questions users actually ask?

What Benchmarks Are Telling Us

Benchmarks are not your product. They are still useful because they reveal failure modes that show up again in production.

1. Naive RAG Improves Answers, But Leaves a Big Gap

The CRAG benchmark was designed around factual QA with dynamic facts, long-tail entities, multiple domains, and simulated web and knowledge graph APIs. Its headline result is sobering: strong LLMs alone reach at most 34% accuracy, straightforward RAG improves to 44%, and strong industry RAG systems answer 63% of questions without hallucination.

The takeaway is not “RAG failed.” The takeaway is that simple top-k retrieval is only the first rung. Real questions often require freshness, entity disambiguation, temporal reasoning, multi-source synthesis, and careful rejection when evidence is missing.

2. RAG Needs More Than Answer Accuracy

RGB evaluates RAG systems on four abilities:

AbilityPractical meaning
Noise robustnessCan the model ignore irrelevant retrieved text?
Negative rejectionCan it say “I do not know” when evidence is absent?
Information integrationCan it combine multiple relevant pieces?
Counterfactual robustnessCan it avoid trusting retrieved misinformation?

This maps almost perfectly to production incidents. Many RAG apps do not fail because retrieval returned nothing. They fail because retrieval returned a confusing mix of true, stale, partial, and irrelevant context.

3. Evaluation Needs to Diagnose Components

RAGBench and RAGChecker both push evaluation toward explainability and component-level diagnosis. That matters because “bad answer” is not actionable enough.

A final answer can fail for different reasons:

  • The correct chunk was never indexed.
  • The chunk exists but was not retrieved.
  • The chunk was retrieved but ranked too low.
  • The retrieved context was correct but too noisy.
  • The model ignored the evidence.
  • The model answered correctly but cited the wrong source.
  • The source was stale or superseded.

Those are different bugs. They need different fixes.

The Design Rule: Preserve Meaningful Boundaries

Most internal knowledge is not a smooth stream of paragraphs. It has shape.

Examples:

  • A policy has sections, exceptions, effective dates, owners, and approval status.
  • A support article has symptoms, causes, fixes, affected versions, and related tickets.
  • An architecture decision record has context, decision, consequences, and linked systems.
  • A contract has clauses, definitions, dependencies, jurisdictions, and amendment history.
  • A product spec has goals, non-goals, requirements, flows, analytics, and open questions.

If you flatten all of that into 800-token chunks with overlap, you often destroy the exact structure the model needs.

Better chunks should behave like small knowledge cards:

Chunk ID: policy.refunds.2026.section_4.exception_2
Document: Refund Policy
Version: 2026-04-10
Section path: Refund Policy > Exceptions > Chargebacks
Audience: Support agents
Applies to: US, CA
Status: Approved
Parent: policy.refunds.2026.section_4
Related: policy.payments.chargebacks, macro.support.refund_denied
Text: ...

This kind of metadata lets the system retrieve by both meaning and structure.

A Practical Architecture for Structured RAG

Here is the architecture I trust for most production RAG apps:

                    +-----------------------+
                    | Source systems        |
                    | Docs, CMS, tickets, DB|
                    +-----------+-----------+
                                |
                                v
                    +-----------------------+
                    | Structured ingestion  |
                    | parse, clean, version |
                    +-----------+-----------+
                                |
              +-----------------+-----------------+
              v                                   v
   +-----------------------+           +-----------------------+
   | Retrieval units       |           | Source graph / links  |
   | chunks, parents, refs |           | entities, relations   |
   +-----------+-----------+           +-----------+-----------+
               |                                   |
               v                                   v
   +-----------------------+           +-----------------------+
   | Hybrid indexes        |           | Metadata filters      |
   | vector + BM25         |           | tenant, date, status  |
   +-----------+-----------+           +-----------+-----------+
               +-----------------+-----------------+
                                 v
                    +-----------------------+
                    | Rerank + context pack |
                    | diversify, cite, trim |
                    +-----------+-----------+
                                |
                                v
                    +-----------------------+
                    | Grounded generation   |
                    | answer + citations    |
                    +-----------+-----------+
                                |
                                v
                    +-----------------------+
                    | Evaluation + tracing  |
                    | offline + production  |
                    +-----------------------+

The important part is not the boxes. It is the separation of responsibilities. Retrieval units, source graph, metadata filters, and evaluation traces should be first-class concepts, not incidental strings inside a prompt.

Chunking: Start With the User’s Question, Not Token Size

Chunk size matters, but “what is the best chunk size?” is usually the wrong first question.

Ask these instead:

  1. What questions should this chunk answer by itself?
  2. What parent context does it need?
  3. What metadata must be filterable?
  4. What related chunks should be easy to follow?
  5. What should never be split?

Here is a practical rule of thumb:

Knowledge typeBetter chunk boundary
FAQOne question-answer pair
PolicySection or exception
API docsEndpoint, parameter group, error code, example
ADRDecision, consequence, related system
Legal/contractClause, definition, amendment
Support ticketSymptom, diagnosis, resolution, environment
TablesRow group with headers preserved
Code docsFunction/class plus signature and examples

For well-structured knowledge, I like parent-child retrieval:

Embed small child chunks for precision
        |
        v
Retrieve child chunks
        |
        v
Expand to parent section for context
        |
        v
Rerank parent + child evidence together

This gives you precision without starving the model of context.

Anthropic’s contextual retrieval pattern points in a similar direction: enrich each chunk with document-level context before embedding and lexical indexing. Their reported retrieval failure reduction is large enough to take seriously, especially when paired with reranking. The broader lesson is simple: a chunk should carry enough surrounding meaning to be retrieved correctly outside its original document.

When Knowledge Is Chained, Retrieve the Chain

Some questions are not answered by one chunk.

Example:

Can an enterprise customer in Canada get a refund if they renewed through a reseller and opened a chargeback?

That may require:

  • regional refund policy
  • reseller terms
  • enterprise contract exceptions
  • chargeback policy
  • latest support macro
  • customer account metadata

Flat semantic search may retrieve one or two pieces and miss the chain.

A better strategy is staged retrieval:

Question
   |
   v
Extract entities and constraints
   |
   +--> country = Canada
   +--> customer_type = enterprise
   +--> channel = reseller
   +--> event = renewal
   +--> dispute = chargeback
   |
   v
Retrieve policy candidates with filters
   |
   v
Follow related links and parent sections
   |
   v
Rerank evidence set
   |
   v
Answer only if the chain supports it

GraphRAG-style approaches, RAPTOR-style hierarchical retrieval, and newer graph-based systems like LightRAG are all responses to the same limitation: flat chunks are weak at connected, aggregate, or multi-hop questions.

You do not need a knowledge graph for every RAG app. But you should consider one when users ask questions like:

  • “What changed between these versions?”
  • “Which teams are affected by this policy?”
  • “What is the relationship between these incidents?”
  • “Summarize the themes across all customer complaints.”
  • “Find exceptions that depend on region, plan, and contract type.”

Those are relationship questions, not just similarity questions.

Retrieval: Hybrid First, Fancy Later

For many production systems, the highest-leverage retrieval stack is:

metadata filters
   +
BM25 / keyword search
   +
vector search
   +
reranker
   +
context packing

Why hybrid?

Vector search is good at semantic similarity. Keyword search is good at exact terms, product names, IDs, error codes, clause numbers, and weird internal language. Business knowledge has a lot of weird internal language.

Reranking matters because top-k retrieval is usually noisy. A cross-encoder or LLM reranker can inspect the query and candidate chunks together, then reorder the list before the generator sees it.

Context packing matters because a model can lose the answer inside too much irrelevant context. More retrieved tokens can reduce quality if they drown out the evidence.

Generation: Make the Model Show Its Evidence

The generation prompt should make the evidence contract explicit:

Use only the provided sources.
Answer the user's question directly.
Include citations for every factual claim.
If the sources conflict, name the conflict.
If the sources do not contain the answer, say what is missing.
Do not infer policy from similar but non-applicable sources.

For PMs, this is the product contract: users should know whether the system knows, does not know, or found conflicting evidence.

For engineers, this is the debugging contract: every answer should point back to retrievable source IDs.

Practical Example: A Support Policy Assistant

Let’s make this less abstract.

Imagine a SaaS company wants an internal assistant for support agents. The assistant answers policy questions before an agent replies to a customer.

The knowledge sources are:

  • refund policy pages
  • reseller terms
  • enterprise contract templates
  • support macros
  • incident notes
  • product plan metadata

The hardest question is not “What is the refund policy?” It is a chained policy question:

Customer ACME is on Enterprise, renewed through a reseller, is in Canada, and opened a chargeback. Can support issue a refund?

A practical RAG flow would look like this:

User question
   |
   v
Extract constraints
   - customer = ACME
   - plan = Enterprise
   - region = Canada
   - sales_channel = reseller
   - event = renewal
   - dispute = chargeback
   |
   v
Fetch account metadata from app DB
   |
   v
Retrieve policy chunks with metadata filters
   - status = approved
   - effective_date <= today
   - region in [Canada, Global]
   - audience = support
   |
   v
Hybrid search
   - vector: "reseller renewal refund chargeback enterprise"
   - keyword: reseller, chargeback, renewal, Enterprise
   |
   v
Rerank top 40 candidates down to top 8
   |
   v
Expand child chunks to parent sections
   |
   v
Generate answer with citations and a decision state

The assistant should not return a vague answer like:

It depends. Please review the reseller and chargeback policy.

A better answer is structured:

{
  "decision": "do_not_refund_without_escalation",
  "confidence": "medium",
  "answer": "Support should not issue the refund directly. The customer renewed through a reseller, and the chargeback policy requires finance escalation before refund approval.",
  "required_escalation": "finance",
  "citations": [
    "policy.refunds.2026.reseller_terms.section_2",
    "policy.payments.2026.chargebacks.section_4",
    "macro.support.refund_escalation.2026"
  ],
  "missing_information": [
    "The current reseller agreement ID was not available in retrieved sources."
  ]
}

Notice what this does for the product:

  • The support agent gets a direct recommendation.
  • The answer is grounded in stable source IDs.
  • The system exposes missing information instead of hiding uncertainty.
  • The decision can be audited later.
  • The response can be evaluated automatically.

The ingestion model behind this can stay simple:

chunk = {
    "id": "policy.payments.2026.chargebacks.section_4",
    "text": "Chargeback-related refunds require finance review before approval...",
    "parent_id": "policy.payments.2026.chargebacks",
    "section_path": "Payments > Chargebacks > Refund approval",
    "source_url": "https://internal.example.com/policies/payments#chargebacks",
    "status": "approved",
    "effective_date": "2026-04-01",
    "owner": "finance-ops",
    "audience": "support",
    "regions": ["global"],
    "entities": ["refund", "chargeback", "finance review"],
    "related_ids": [
        "policy.refunds.2026.reseller_terms.section_2",
        "macro.support.refund_escalation.2026"
    ],
}

And your retrieval function should return evidence, not just text:

def retrieve_policy_evidence(question: str, account: dict) -> list[dict]:
    filters = {
        "status": "approved",
        "audience": "support",
        "regions": ["global", account["country"]],
        "effective_date_lte": "2026-06-15",
    }

    dense_hits = vector_search(question, filters=filters, top_k=30)
    keyword_hits = keyword_search(question, filters=filters, top_k=30)
    fused_hits = reciprocal_rank_fusion([dense_hits, keyword_hits], limit=40)
    reranked = rerank(question, fused_hits, top_k=8)

    return expand_with_parent_sections(reranked)

The eval for this one example should check:

CheckExpected
Retrieval recallReseller terms and chargeback policy are in top 8
Citation coverageFinal answer cites both required policy sections
Decision correctnessdecision = do_not_refund_without_escalation
Missing informationMentions reseller agreement ID if unavailable
GroundingNo claim appears without source support
Product behaviorEscalates to finance instead of inventing an approval

That is the practical loop: design the source IDs, retrieve the evidence chain, force the answer into an auditable shape, and evaluate each step.

There is no universal best RAG stack, but there are sensible defaults. I would choose the stack based on team size, data shape, and how much operational control you need.

Default Stack for Most Product Teams

LayerRecommendationWhy
App/APIFastAPI, Node.js, or your existing backendKeep RAG close to product auth, tenancy, and logs.
IngestionLlamaIndex or custom pipeline with unstructured parsersGood abstractions for parsing, metadata, nodes, and retrieval patterns.
StoragePostgres for source metadata and eval datasetsYour product already trusts it; great for IDs, versions, permissions, and audit trails.
Retrieval indexQdrant, pgvector, Weaviate, Pinecone, or OpenSearchPick based on scale and team ops preferences; require metadata filters.
Hybrid searchDense vectors + BM25/sparse vectorsBetter for product names, IDs, clause numbers, and natural-language questions.
EmbeddingsOpenAI text-embedding-3-small for cost, text-embedding-3-large for quality-sensitive searchOfficial docs list 1536 dimensions for small and 3072 for large, with dimension reduction available.
RerankingCohere Rerank, Voyage rerankers, bge-reranker, or provider-native rerankRerankers usually improve precision before the LLM sees context.
GenerationGPT-4.1/GPT-5 class model, Claude Sonnet class model, or your approved enterprise modelChoose by eval results, not brand preference.
EvaluationRAGAS or DeepEval locally; LangSmith, LangWatch, TruLens, or Arize Phoenix for managed tracingYou need retrieval and generation metrics, plus trace debugging.
ObservabilityOpenTelemetry + product analytics + eval dashboardsConnect model quality to user outcomes, latency, and cost.

My default implementation choice for a serious but not enormous product would be:

Backend:        FastAPI or existing Node.js service
Metadata DB:    Postgres
Vector search:  Qdrant with dense + sparse hybrid search
Embeddings:     text-embedding-3-large for quality, small for cost-sensitive paths
Reranker:       Cohere Rerank or Voyage reranker
Orchestration:  LlamaIndex for ingestion/retrieval, thin custom app logic
Evaluation:     RAGAS + LangSmith/LangWatch traces
Monitoring:     OpenTelemetry + dashboards for latency, cost, retrieval recall, faithfulness

Why this stack?

  • Qdrant has first-class hybrid queries with dense, sparse, and multivector retrieval plus score fusion.
  • pgvector is excellent when you want fewer moving parts and already run Postgres; combine it with Postgres full-text search and reciprocal rank fusion for hybrid search.
  • LlamaIndex is useful for structured ingestion, metadata-aware retrieval, and production RAG patterns.
  • LangSmith, LangWatch, TruLens, RAGAS, and DeepEval all help separate retrieval quality from answer quality.
  • Cohere’s rerank API and similar rerankers are designed to reorder candidate documents by relevance before generation.

Lightweight Stack for an MVP

Use this when you need to learn fast:

Postgres + pgvector
Postgres full-text search
OpenAI embeddings
One reranker
RAGAS or DeepEval test set
Simple trace table in Postgres

This is boring in the best way. It keeps the system understandable. The main tradeoff is that you will build more retrieval glue yourself.

Heavier Stack for Enterprise Knowledge

Use this when knowledge is large, permissioned, cross-linked, and audited:

Document pipeline: LlamaParse, Unstructured, or custom parsers
Metadata system: Postgres
Search: OpenSearch/Elasticsearch + vector DB, or Qdrant/Weaviate with hybrid search
Graph layer: Neo4j, Kuzu, or graph tables in Postgres
Reranking: dedicated reranker
LLM gateway: LiteLLM or internal provider gateway
Eval/observability: LangSmith, LangWatch, Arize Phoenix, TruLens, OpenTelemetry
Human review: annotation queue for failed and high-risk answers

This stack is heavier because enterprise RAG is rarely just search. It is permissions, freshness, ownership, auditability, and incident response wearing a search costume.

Full-Stack RAG with Agno Built-In Features

If your team is Python-first and wants to ship a RAG product without assembling every runtime piece by hand, Agno is a strong candidate.

As of June 15, 2026, the current Agno docs describe three useful layers:

LayerAgno featureWhat it gives you
KnowledgeKnowledge, vector DB integrations, content DBDocuments, metadata, embeddings, and searchable knowledge.
Agent runtimeAgent, tools, structured output, teams, workflowsThe reasoning and action layer that answers or delegates.
Product runtimeAgentOS, Agent API, AG-UI, AgentUI, tracing, evals, RBACThe API, UI, sessions, traces, auth, and operating surface.

Agno’s important design choice is that knowledge can be attached directly to an agent. With knowledge attached, Agno uses Agentic RAG by default: the agent can decide when to search the knowledge base. If you want classic RAG, where relevant references are always added to context based on the user message, Agno exposes that mode too.

Here is how I would implement the support-policy assistant from the previous section with Agno.

1. Store Knowledge in Postgres + PgVector

For a serious product, start with Postgres because it can hold source metadata, sessions, traces, eval datasets, and vectors through PgVector.

from agno.db.postgres.postgres import PostgresDb
from agno.knowledge.embedder.openai import OpenAIEmbedder
from agno.knowledge.knowledge import Knowledge
from agno.vectordb.pgvector import PgVector, SearchType

db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"

db = PostgresDb(db_url=db_url)

policy_knowledge = Knowledge(
    name="support_policy_knowledge",
    description="Approved support policies, reseller terms, chargeback rules, and support macros.",
    contents_db=db,
    vector_db=PgVector(
        table_name="support_policy_vectors",
        db_url=db_url,
        embedder=OpenAIEmbedder(),
        search_type=SearchType.hybrid,
    ),
)

The important pieces:

  • contents_db gives AgentOS a place to manage uploaded content and metadata.
  • PgVector stores embeddings and supports production-friendly Postgres operations.
  • SearchType.hybrid keeps exact policy terms, IDs, and semantic meaning in the same retrieval path.
  • You can swap PgVector for Qdrant, Weaviate, Pinecone, LanceDB, Chroma, Redis, or other supported vector stores without rewriting the agent logic.

2. Insert Documents with Metadata

Agno can ingest content into the knowledge base, but you still need to treat metadata as product design.

await policy_knowledge.ainsert(
    name="Refund Policy 2026",
    url="https://internal.example.com/policies/refunds-2026.pdf",
    metadata={
        "status": "approved",
        "audience": "support",
        "owner": "finance-ops",
        "effective_date": "2026-04-01",
        "regions": ["global", "CA", "US"],
        "source_type": "policy",
    },
)

Do not stop at “upload a PDF.” Add the metadata your retrieval policy needs: status, owner, region, product, plan, effective date, source type, and permission scope.

3. Add Product Tools Around the Knowledge Base

Knowledge answers policy questions. Tools answer product-state questions.

from agno.tools import tool


@tool
def get_customer_account(customer_name: str) -> dict:
    """Fetch account facts needed for policy decisions."""
    return {
        "customer": customer_name,
        "plan": "Enterprise",
        "country": "CA",
        "sales_channel": "reseller",
        "renewal_date": "2026-05-12",
        "chargeback_open": True,
    }

This is the correct split:

Knowledge base -> "What do the policies say?"
Product tools  -> "What is true about this customer right now?"
Agent          -> "Given both, what decision is supported?"

4. Force a Structured Decision Output

For product workflows, do not return loose prose. Use structured output so the frontend, audit log, and evals can rely on stable fields.

from pydantic import BaseModel, Field


class PolicyDecision(BaseModel):
    decision: str = Field(description="refund_allowed, deny, or escalate")
    confidence: str = Field(description="low, medium, or high")
    answer: str
    required_escalation: str | None = None
    citations: list[str]
    missing_information: list[str] = []

Agno supports Pydantic-backed structured outputs, so this becomes an API contract rather than a parsing exercise.

5. Build the RAG Agent

from agno.agent import Agent
from agno.models.openai import OpenAIResponses

policy_agent = Agent(
    name="support_policy_agent",
    model=OpenAIResponses(id="gpt-5.2"),
    knowledge=policy_knowledge,
    tools=[get_customer_account],
    output_schema=PolicyDecision,
    instructions=[
        "Answer only from approved policy knowledge and account tool data.",
        "Search the knowledge base before making policy claims.",
        "Cite source IDs for every policy claim.",
        "If required sources are missing or conflict, escalate instead of guessing.",
        "Never approve a refund when chargeback rules require finance review.",
    ],
)

This already gives you the core app behavior:

  • knowledge search
  • tool use
  • grounded answer generation
  • typed output
  • traceable source usage

6. Serve It with AgentOS

AgentOS turns the agent into a production API. The current Agno docs describe REST and SSE as default API surfaces, with options for AG-UI, WebSocket, MCP, A2A, Slack, and other interfaces.

from agno.os import AgentOS
from agno.os.interfaces.agui import AGUI

agent_os = AgentOS(
    name="Support Policy OS",
    agents=[policy_agent],
    interfaces=[AGUI(agent=policy_agent)],
    tracing=True,
)

app = agent_os.get_app()

if __name__ == "__main__":
    agent_os.serve(app="support_policy_os:app", reload=True)

That gives you:

  • FastAPI runtime
  • REST endpoints
  • SSE-compatible streaming
  • session APIs
  • run APIs
  • trace capture
  • AgentOS UI connectivity
  • AG-UI-compatible frontend integration

In other words, your frontend can be an Astro/React/Next.js app, but the RAG runtime does not need to be hand-rolled.

7. Use Built-In Evals Before Release

Agno has built-in eval dimensions for accuracy, agent-as-judge criteria, performance, and reliability. For this RAG assistant, I would run at least two eval types:

Accuracy eval:
  Input: "ACME renewed through reseller in Canada and has an open chargeback. Can support refund?"
  Expected: decision = escalate, required_escalation = finance

Reliability eval:
  Expected tool calls:
    - get_customer_account
    - search_knowledge_base

The release gate should not be “the answer looked good once.” It should be:

Golden set passes
   +
required tool calls happen
   +
citations are present
   +
structured output validates
   +
trace shows the right knowledge was retrieved

8. Use AgentOS for the Operating Loop

Once deployed, use Agno’s database-backed runtime features as your improvement loop:

Production run
   |
   v
AgentOS stores session + trace + tool calls
   |
   v
Review bad or low-confidence answers
   |
   v
Promote failures into eval examples
   |
   v
Improve metadata, policies, tools, prompt, or model
   |
   v
Rerun evals before release

This is where Agno’s “built-in” features matter most. AgentOS is not only a way to expose an agent as an API. It gives you the runtime records needed to debug, evaluate, and continuously improve the RAG app.

When Agno Is the Right Fit

Use Agno for RAG when:

  • your backend team is comfortable with Python
  • you want Agentic RAG without writing the search tool contract from scratch
  • you want sessions, traces, memory, knowledge, evals, and APIs in one runtime
  • you need to expose the same agent through REST, streaming, AG-UI, or other interfaces
  • you want data ownership, with sessions, knowledge, and traces in your database

Be more cautious when:

  • your company already has a mature TypeScript-only AI runtime
  • you need very custom retrieval orchestration that fights the framework
  • your main challenge is search infrastructure, not agent runtime

My practical recommendation: use Agno to own the agent/runtime layer, but keep the knowledge model explicit. Stable IDs, metadata, retrieval tests, and source ownership are still your responsibility.

Evaluation: Measure the Whole RAG Chain

Good RAG evaluation has at least four layers.

LayerQuestionExample metric
RetrievalDid we retrieve the right evidence?recall@k, context precision, MRR
ContextDid we pass useful, non-noisy context?context relevance, redundancy, freshness
GenerationDid the answer use evidence correctly?faithfulness, groundedness, answer relevance
ProductDid this satisfy the user safely?task success, escalation rate, CSAT, human review

RAGAS, TruLens, DeepEval, LangSmith, and similar tools all circle around the same core idea: evaluate retrieval and generation separately, then evaluate the end-to-end experience.

Here is a small evaluator sketch:

from dataclasses import dataclass


@dataclass
class RagExample:
    question: str
    required_source_ids: set[str]
    expected_answer_facts: set[str]


@dataclass
class RagRun:
    retrieved_source_ids: list[str]
    answer: str
    cited_source_ids: set[str]


def recall_at_k(example: RagExample, run: RagRun, k: int = 5) -> float:
    retrieved = set(run.retrieved_source_ids[:k])
    if not example.required_source_ids:
        return 1.0
    return len(example.required_source_ids & retrieved) / len(example.required_source_ids)


def citation_coverage(example: RagExample, run: RagRun) -> float:
    if not example.required_source_ids:
        return 1.0
    return len(example.required_source_ids & run.cited_source_ids) / len(example.required_source_ids)


def unsupported_fact_count(answer_claims: set[str], supported_claims: set[str]) -> int:
    return len(answer_claims - supported_claims)

This is intentionally simple. In a real system, answer_claims and supported_claims might come from an LLM judge, a claim extractor, or human labels. The point is that you should not only ask, “Was the answer nice?” Ask whether the required sources were retrieved, cited, and used.

The Eval Set PMs and Engineers Can Share

A useful RAG eval set should include more than happy-path questions.

Case typeWhy it matters
Known answerBasic correctness
Missing answerRefusal and uncertainty
Conflicting sourcesConflict handling
Stale sourceFreshness and versioning
Multi-hop questionChained evidence
Entity ambiguityDisambiguation
Exact identifierKeyword and metadata retrieval
Table-heavy answerParser and chunk quality
Policy exceptionBoundary logic
User typo or vague queryQuery rewriting

A practical starting set is 50 to 100 examples:

  • 30 common user questions
  • 20 edge cases from production logs or support tickets
  • 20 adversarial or stale-source cases
  • 10 missing-answer cases
  • 10 multi-hop or relationship questions

Then add failures from production every week.

How to Diagnose Failures

When a RAG answer fails, do not start by changing the prompt. Use the trace.

Bad answer
   |
   v
Was the needed source indexed?
   | no -> ingestion/parser/versioning bug
   v yes
Was it retrieved in top-k?
   | no -> query, embeddings, filters, chunking, hybrid search
   v yes
Was it ranked high enough?
   | no -> reranker, scoring, metadata boost
   v yes
Was it included in final context?
   | no -> context packing, dedupe, token budget
   v yes
Did the model use it correctly?
   | no -> prompt, model, citation rules, post-check
   v yes
Is the expected label wrong?
       -> update dataset

This is where PMs can help a lot. Many failures are not model failures. They are knowledge ownership failures: stale docs, unclear policy, missing source of truth, or contradictory pages.

Best Practices I Would Use in 2026

Here is the short version I would put into a team checklist.

  1. Design the corpus before tuning the model.
  2. Keep source IDs stable and citation-friendly.
  3. Preserve document hierarchy and section paths.
  4. Add metadata for tenant, product, region, version, status, owner, and effective date.
  5. Use hybrid retrieval for production unless you have evidence that vector-only is enough.
  6. Rerank before generation.
  7. Retrieve small, answer with parent context.
  8. Add query rewriting only after you can see retrieval traces.
  9. Treat tables, images, diagrams, and code blocks as special ingestion cases.
  10. Require grounded answers with citations.
  11. Teach the system to say “not found” and “sources conflict.”
  12. Evaluate retrieval, context, generation, and product success separately.
  13. Maintain a living eval set from real failures.
  14. Monitor production drift as documents, user questions, and models change.
  15. Put humans in the loop for policy, legal, medical, financial, or high-impact decisions.

A Simple Decision Guide

Use simple RAG when:

  • questions are mostly single-hop
  • docs are clean and already structured
  • freshness needs are moderate
  • citations are useful but not legally critical

Use structured/hierarchical RAG when:

  • users ask about policies, specs, procedures, contracts, or manuals
  • chunks need parent context
  • sections, versions, owners, and exceptions matter
  • table and code boundaries matter

Use GraphRAG or relationship-aware retrieval when:

  • questions require multi-hop reasoning
  • answers aggregate across many documents
  • entity relationships matter
  • the corpus is narrative, messy, or cross-linked
  • users ask “why,” “what changed,” “what depends on this,” or “which themes appear across the corpus”

Use fine-tuning when:

  • the model has the right context but repeatedly uses the wrong style, format, or decision policy
  • you need consistent behavior more than new knowledge
  • you have a high-quality representative training set

Do not use RAG as a substitute for fixing the knowledge base. If the source material is contradictory, stale, or ownerless, retrieval will mostly make those problems visible faster.

Open Gaps and Follow-Up Sections

This article focuses on design and evaluation. Three areas deserve deeper treatment:

  • Security: prompt injection through retrieved documents, source permissions, tenant isolation, and audit logging.
  • Multimodal RAG: PDFs with tables, screenshots, diagrams, image-heavy manuals, and OCR confidence.
  • Cost engineering: embedding refresh strategy, caching, reranker cost, context budgets, and latency SLOs.

Those are not side quests. They are usually what decides whether a RAG app survives production traffic.

Source List