RAG Apps in Practice: Structured Knowledge and Evaluation

TL;DR

RAG is not “upload documents and ask questions.” Production RAG is a knowledge system: ingestion, chunking, metadata, retrieval, reranking, answer synthesis, citation, evaluation, monitoring, and repair.
The biggest design mistake is treating well-structured knowledge as flat text. Policies, tickets, specs, contracts, manuals, and architecture docs usually need hierarchy, metadata, stable IDs, parent-child chunks, and relationship-aware retrieval.
Benchmarks keep showing the same uncomfortable truth: naive RAG helps, but it is not enough. CRAG reports that strong LLMs alone answer at most 34% of its factual questions, straightforward RAG reaches 44%, and strong industry RAG systems reach 63% without hallucination.
Evaluate the chain, not only the final answer. Measure retrieval recall, context precision, grounding or faithfulness, answer relevance, refusal behavior, citation quality, latency, cost, and production drift.
The best current pattern is hybrid: structured ingestion + semantic and lexical retrieval + metadata filters + reranking + source-aware generation + automated evals + human review on high-risk cases.
For connected knowledge, consider hierarchical retrieval, contextual chunks, GraphRAG-style summaries, or knowledge graph retrieval. Use them when the question needs relationships, aggregation, multi-hop reasoning, or corpus-level synthesis.
A strong default stack is Postgres for metadata and eval data, Qdrant or pgvector for hybrid retrieval, OpenAI or Voyage embeddings, a dedicated reranker, LlamaIndex or thin custom orchestration, and RAGAS/LangSmith/LangWatch-style evals.
If you want a Python-first full-stack implementation, Agno can cover a lot of the app surface: knowledge bases, vector DB integrations, agentic RAG, structured outputs, tools, AgentOS APIs, sessions, traces, RBAC, UI integration, and built-in evals.

What You Will Learn Here

How to think about RAG as a chain of decisions, not a single vector search call.
What recent RAG benchmarks say about naive retrieval, hallucination, dynamic facts, and multi-hop questions.
How to design chunks for structured and chained knowledge.
Which evaluation metrics matter for engineers and PMs.
A practical reference architecture you can adapt for production apps.
A concrete support-policy assistant example with source IDs, retrieval flow, JSON output, and eval checks.
A recommended RAG technology stack for MVPs, serious product teams, and enterprise knowledge systems.
How to assemble the same kind of full-stack RAG app with Agno and its built-in features.
A small Python-style example for checking retrieval and grounded answers.

This article is for engineers and PMs building AI apps over product docs, internal knowledge bases, support histories, policies, specs, contracts, runbooks, customer records, or other business knowledge.

The friendly warning: RAG is one of the easiest AI patterns to demo and one of the easiest to mis-measure.

RAG Is a Chain, Not a Feature

The original RAG idea is simple: retrieve external knowledge, put it into the model context, and generate an answer grounded in that knowledge. That is still the core move.

But in practice, a RAG app is a chain:

User question
     |
     v
Intent + query understanding
     |
     v
Retriever selection + filters
     |
     v
Hybrid search over structured chunks
     |
     v
Reranking + deduplication + context packing
     |
     v
Answer generation with citations
     |
     v
Post-checks: grounding, policy, format, refusal
     |
     v
Telemetry + evals + dataset updates

Every step can fail.

If chunking breaks a table, retrieval may miss the answer. If retrieval returns the right page but the wrong section, the model may guess. If the answer is correct but citation IDs are unstable, nobody can audit it. If evals only score final answer style, the team may never notice that retrieval recall got worse after a corpus update.

So the design question is not:

Which vector database should we use?

It is:

How do we preserve the knowledge structure needed to answer the questions users actually ask?

What Benchmarks Are Telling Us

Benchmarks are not your product. They are still useful because they reveal failure modes that show up again in production.

1. Naive RAG Improves Answers, But Leaves a Big Gap

The CRAG benchmark was designed around factual QA with dynamic facts, long-tail entities, multiple domains, and simulated web and knowledge graph APIs. Its headline result is sobering: strong LLMs alone reach at most 34% accuracy, straightforward RAG improves to 44%, and strong industry RAG systems answer 63% of questions without hallucination.

The takeaway is not “RAG failed.” The takeaway is that simple top-k retrieval is only the first rung. Real questions often require freshness, entity disambiguation, temporal reasoning, multi-source synthesis, and careful rejection when evidence is missing.

2. RAG Needs More Than Answer Accuracy

RGB evaluates RAG systems on four abilities:

Ability	Practical meaning
Noise robustness	Can the model ignore irrelevant retrieved text?
Negative rejection	Can it say “I do not know” when evidence is absent?
Information integration	Can it combine multiple relevant pieces?
Counterfactual robustness	Can it avoid trusting retrieved misinformation?

This maps almost perfectly to production incidents. Many RAG apps do not fail because retrieval returned nothing. They fail because retrieval returned a confusing mix of true, stale, partial, and irrelevant context.

3. Evaluation Needs to Diagnose Components

RAGBench and RAGChecker both push evaluation toward explainability and component-level diagnosis. That matters because “bad answer” is not actionable enough.

A final answer can fail for different reasons:

The correct chunk was never indexed.
The chunk exists but was not retrieved.
The chunk was retrieved but ranked too low.
The retrieved context was correct but too noisy.
The model ignored the evidence.
The model answered correctly but cited the wrong source.
The source was stale or superseded.

Those are different bugs. They need different fixes.

The Design Rule: Preserve Meaningful Boundaries

Most internal knowledge is not a smooth stream of paragraphs. It has shape.

Examples:

A policy has sections, exceptions, effective dates, owners, and approval status.
A support article has symptoms, causes, fixes, affected versions, and related tickets.
An architecture decision record has context, decision, consequences, and linked systems.
A contract has clauses, definitions, dependencies, jurisdictions, and amendment history.
A product spec has goals, non-goals, requirements, flows, analytics, and open questions.

If you flatten all of that into 800-token chunks with overlap, you often destroy the exact structure the model needs.

Better chunks should behave like small knowledge cards:

Chunk ID: policy.refunds.2026.section_4.exception_2
Document: Refund Policy
Version: 2026-04-10
Section path: Refund Policy > Exceptions > Chargebacks
Audience: Support agents
Applies to: US, CA
Status: Approved
Parent: policy.refunds.2026.section_4
Related: policy.payments.chargebacks, macro.support.refund_denied
Text: ...

This kind of metadata lets the system retrieve by both meaning and structure.

A Practical Architecture for Structured RAG

Here is the architecture I trust for most production RAG apps:

                    +-----------------------+
                    | Source systems        |
                    | Docs, CMS, tickets, DB|
                    +-----------+-----------+
                                |
                                v
                    +-----------------------+
                    | Structured ingestion  |
                    | parse, clean, version |
                    +-----------+-----------+
                                |
              +-----------------+-----------------+
              v                                   v
   +-----------------------+           +-----------------------+
   | Retrieval units       |           | Source graph / links  |
   | chunks, parents, refs |           | entities, relations   |
   +-----------+-----------+           +-----------+-----------+
               |                                   |
               v                                   v
   +-----------------------+           +-----------------------+
   | Hybrid indexes        |           | Metadata filters      |
   | vector + BM25         |           | tenant, date, status  |
   +-----------+-----------+           +-----------+-----------+
               +-----------------+-----------------+
                                 v
                    +-----------------------+
                    | Rerank + context pack |
                    | diversify, cite, trim |
                    +-----------+-----------+
                                |
                                v
                    +-----------------------+
                    | Grounded generation   |
                    | answer + citations    |
                    +-----------+-----------+
                                |
                                v
                    +-----------------------+
                    | Evaluation + tracing  |
                    | offline + production  |
                    +-----------------------+

The important part is not the boxes. It is the separation of responsibilities. Retrieval units, source graph, metadata filters, and evaluation traces should be first-class concepts, not incidental strings inside a prompt.

Chunking: Start With the User’s Question, Not Token Size

Chunk size matters, but “what is the best chunk size?” is usually the wrong first question.

Ask these instead:

What questions should this chunk answer by itself?
What parent context does it need?
What metadata must be filterable?
What related chunks should be easy to follow?
What should never be split?

Here is a practical rule of thumb:

Knowledge type	Better chunk boundary
FAQ	One question-answer pair
Policy	Section or exception
API docs	Endpoint, parameter group, error code, example
ADR	Decision, consequence, related system
Legal/contract	Clause, definition, amendment
Support ticket	Symptom, diagnosis, resolution, environment
Tables	Row group with headers preserved
Code docs	Function/class plus signature and examples

For well-structured knowledge, I like parent-child retrieval:

Embed small child chunks for precision
        |
        v
Retrieve child chunks
        |
        v
Expand to parent section for context
        |
        v
Rerank parent + child evidence together

This gives you precision without starving the model of context.

Anthropic’s contextual retrieval pattern points in a similar direction: enrich each chunk with document-level context before embedding and lexical indexing. Their reported retrieval failure reduction is large enough to take seriously, especially when paired with reranking. The broader lesson is simple: a chunk should carry enough surrounding meaning to be retrieved correctly outside its original document.

When Knowledge Is Chained, Retrieve the Chain

Some questions are not answered by one chunk.

Example:

Can an enterprise customer in Canada get a refund if they renewed through a reseller and opened a chargeback?

That may require:

regional refund policy
reseller terms
enterprise contract exceptions
chargeback policy
latest support macro
customer account metadata

Flat semantic search may retrieve one or two pieces and miss the chain.

A better strategy is staged retrieval:

Question
   |
   v
Extract entities and constraints
   |
   +--> country = Canada
   +--> customer_type = enterprise
   +--> channel = reseller
   +--> event = renewal
   +--> dispute = chargeback
   |
   v
Retrieve policy candidates with filters
   |
   v
Follow related links and parent sections
   |
   v
Rerank evidence set
   |
   v
Answer only if the chain supports it

GraphRAG-style approaches, RAPTOR-style hierarchical retrieval, and newer graph-based systems like LightRAG are all responses to the same limitation: flat chunks are weak at connected, aggregate, or multi-hop questions.

You do not need a knowledge graph for every RAG app. But you should consider one when users ask questions like:

“What changed between these versions?”
“Which teams are affected by this policy?”
“What is the relationship between these incidents?”
“Summarize the themes across all customer complaints.”
“Find exceptions that depend on region, plan, and contract type.”

Those are relationship questions, not just similarity questions.

Retrieval: Hybrid First, Fancy Later

For many production systems, the highest-leverage retrieval stack is:

metadata filters
   +
BM25 / keyword search
   +
vector search
   +
reranker
   +
context packing

Why hybrid?

Vector search is good at semantic similarity. Keyword search is good at exact terms, product names, IDs, error codes, clause numbers, and weird internal language. Business knowledge has a lot of weird internal language.

Reranking matters because top-k retrieval is usually noisy. A cross-encoder or LLM reranker can inspect the query and candidate chunks together, then reorder the list before the generator sees it.

Context packing matters because a model can lose the answer inside too much irrelevant context. More retrieved tokens can reduce quality if they drown out the evidence.

Generation: Make the Model Show Its Evidence

The generation prompt should make the evidence contract explicit:

Use only the provided sources.
Answer the user's question directly.
Include citations for every factual claim.
If the sources conflict, name the conflict.
If the sources do not contain the answer, say what is missing.
Do not infer policy from similar but non-applicable sources.

For PMs, this is the product contract: users should know whether the system knows, does not know, or found conflicting evidence.

For engineers, this is the debugging contract: every answer should point back to retrievable source IDs.

Practical Example: A Support Policy Assistant

Let’s make this less abstract.

Imagine a SaaS company wants an internal assistant for support agents. The assistant answers policy questions before an agent replies to a customer.

The knowledge sources are:

refund policy pages
reseller terms
enterprise contract templates
support macros
incident notes
product plan metadata

The hardest question is not “What is the refund policy?” It is a chained policy question:

Customer ACME is on Enterprise, renewed through a reseller, is in Canada, and opened a chargeback. Can support issue a refund?

A practical RAG flow would look like this:

User question
   |
   v
Extract constraints
   - customer = ACME
   - plan = Enterprise
   - region = Canada
   - sales_channel = reseller
   - event = renewal
   - dispute = chargeback
   |
   v
Fetch account metadata from app DB
   |
   v
Retrieve policy chunks with metadata filters
   - status = approved
   - effective_date <= today
   - region in [Canada, Global]
   - audience = support
   |
   v
Hybrid search
   - vector: "reseller renewal refund chargeback enterprise"
   - keyword: reseller, chargeback, renewal, Enterprise
   |
   v
Rerank top 40 candidates down to top 8
   |
   v
Expand child chunks to parent sections
   |
   v
Generate answer with citations and a decision state

The assistant should not return a vague answer like:

It depends. Please review the reseller and chargeback policy.

A better answer is structured:

{
  "decision": "do_not_refund_without_escalation",
  "confidence": "medium",
  "answer": "Support should not issue the refund directly. The customer renewed through a reseller, and the chargeback policy requires finance escalation before refund approval.",
  "required_escalation": "finance",
  "citations": [
    "policy.refunds.2026.reseller_terms.section_2",
    "policy.payments.2026.chargebacks.section_4",
    "macro.support.refund_escalation.2026"
  ],
  "missing_information": [
    "The current reseller agreement ID was not available in retrieved sources."
  ]
}

Notice what this does for the product:

The support agent gets a direct recommendation.
The answer is grounded in stable source IDs.
The system exposes missing information instead of hiding uncertainty.
The decision can be audited later.
The response can be evaluated automatically.

The ingestion model behind this can stay simple:

chunk = {
    "id": "policy.payments.2026.chargebacks.section_4",
    "text": "Chargeback-related refunds require finance review before approval...",
    "parent_id": "policy.payments.2026.chargebacks",
    "section_path": "Payments > Chargebacks > Refund approval",
    "source_url": "https://internal.example.com/policies/payments#chargebacks",
    "status": "approved",
    "effective_date": "2026-04-01",
    "owner": "finance-ops",
    "audience": "support",
    "regions": ["global"],
    "entities": ["refund", "chargeback", "finance review"],
    "related_ids": [
        "policy.refunds.2026.reseller_terms.section_2",
        "macro.support.refund_escalation.2026"
    ],
}

And your retrieval function should return evidence, not just text:

def retrieve_policy_evidence(question: str, account: dict) -> list[dict]:
    filters = {
        "status": "approved",
        "audience": "support",
        "regions": ["global", account["country"]],
        "effective_date_lte": "2026-06-15",
    }

    dense_hits = vector_search(question, filters=filters, top_k=30)
    keyword_hits = keyword_search(question, filters=filters, top_k=30)
    fused_hits = reciprocal_rank_fusion([dense_hits, keyword_hits], limit=40)
    reranked = rerank(question, fused_hits, top_k=8)

    return expand_with_parent_sections(reranked)

The eval for this one example should check:

Check	Expected
Retrieval recall	Reseller terms and chargeback policy are in top 8
Citation coverage	Final answer cites both required policy sections
Decision correctness	`decision = do_not_refund_without_escalation`
Missing information	Mentions reseller agreement ID if unavailable
Grounding	No claim appears without source support
Product behavior	Escalates to finance instead of inventing an approval

That is the practical loop: design the source IDs, retrieve the evidence chain, force the answer into an auditable shape, and evaluate each step.

A Recommended Technology Stack

There is no universal best RAG stack, but there are sensible defaults. I would choose the stack based on team size, data shape, and how much operational control you need.

Default Stack for Most Product Teams

Layer	Recommendation	Why
App/API	FastAPI, Node.js, or your existing backend	Keep RAG close to product auth, tenancy, and logs.
Ingestion	LlamaIndex or custom pipeline with unstructured parsers	Good abstractions for parsing, metadata, nodes, and retrieval patterns.
Storage	Postgres for source metadata and eval datasets	Your product already trusts it; great for IDs, versions, permissions, and audit trails.
Retrieval index	Qdrant, pgvector, Weaviate, Pinecone, or OpenSearch	Pick based on scale and team ops preferences; require metadata filters.
Hybrid search	Dense vectors + BM25/sparse vectors	Better for product names, IDs, clause numbers, and natural-language questions.
Embeddings	OpenAI `text-embedding-3-small` for cost, `text-embedding-3-large` for quality-sensitive search	Official docs list 1536 dimensions for small and 3072 for large, with dimension reduction available.
Reranking	Cohere Rerank, Voyage rerankers, bge-reranker, or provider-native rerank	Rerankers usually improve precision before the LLM sees context.
Generation	GPT-4.1/GPT-5 class model, Claude Sonnet class model, or your approved enterprise model	Choose by eval results, not brand preference.
Evaluation	RAGAS or DeepEval locally; LangSmith, LangWatch, TruLens, or Arize Phoenix for managed tracing	You need retrieval and generation metrics, plus trace debugging.
Observability	OpenTelemetry + product analytics + eval dashboards	Connect model quality to user outcomes, latency, and cost.

My default implementation choice for a serious but not enormous product would be:

Backend:        FastAPI or existing Node.js service
Metadata DB:    Postgres
Vector search:  Qdrant with dense + sparse hybrid search
Embeddings:     text-embedding-3-large for quality, small for cost-sensitive paths
Reranker:       Cohere Rerank or Voyage reranker
Orchestration:  LlamaIndex for ingestion/retrieval, thin custom app logic
Evaluation:     RAGAS + LangSmith/LangWatch traces
Monitoring:     OpenTelemetry + dashboards for latency, cost, retrieval recall, faithfulness

Why this stack?

Qdrant has first-class hybrid queries with dense, sparse, and multivector retrieval plus score fusion.
pgvector is excellent when you want fewer moving parts and already run Postgres; combine it with Postgres full-text search and reciprocal rank fusion for hybrid search.
LlamaIndex is useful for structured ingestion, metadata-aware retrieval, and production RAG patterns.
LangSmith, LangWatch, TruLens, RAGAS, and DeepEval all help separate retrieval quality from answer quality.
Cohere’s rerank API and similar rerankers are designed to reorder candidate documents by relevance before generation.

Lightweight Stack for an MVP

Use this when you need to learn fast:

Postgres + pgvector
Postgres full-text search
OpenAI embeddings
One reranker
RAGAS or DeepEval test set
Simple trace table in Postgres

This is boring in the best way. It keeps the system understandable. The main tradeoff is that you will build more retrieval glue yourself.

Heavier Stack for Enterprise Knowledge

Use this when knowledge is large, permissioned, cross-linked, and audited:

Document pipeline: LlamaParse, Unstructured, or custom parsers
Metadata system: Postgres
Search: OpenSearch/Elasticsearch + vector DB, or Qdrant/Weaviate with hybrid search
Graph layer: Neo4j, Kuzu, or graph tables in Postgres
Reranking: dedicated reranker
LLM gateway: LiteLLM or internal provider gateway
Eval/observability: LangSmith, LangWatch, Arize Phoenix, TruLens, OpenTelemetry
Human review: annotation queue for failed and high-risk answers

This stack is heavier because enterprise RAG is rarely just search. It is permissions, freshness, ownership, auditability, and incident response wearing a search costume.

Full-Stack RAG with Agno Built-In Features

If your team is Python-first and wants to ship a RAG product without assembling every runtime piece by hand, Agno is a strong candidate.

As of June 15, 2026, the current Agno docs describe three useful layers:

Layer	Agno feature	What it gives you
Knowledge	`Knowledge`, vector DB integrations, content DB	Documents, metadata, embeddings, and searchable knowledge.
Agent runtime	`Agent`, tools, structured output, teams, workflows	The reasoning and action layer that answers or delegates.
Product runtime	AgentOS, Agent API, AG-UI, AgentUI, tracing, evals, RBAC	The API, UI, sessions, traces, auth, and operating surface.

Agno’s important design choice is that knowledge can be attached directly to an agent. With knowledge attached, Agno uses Agentic RAG by default: the agent can decide when to search the knowledge base. If you want classic RAG, where relevant references are always added to context based on the user message, Agno exposes that mode too.

Here is how I would implement the support-policy assistant from the previous section with Agno.

1. Store Knowledge in Postgres + PgVector

For a serious product, start with Postgres because it can hold source metadata, sessions, traces, eval datasets, and vectors through PgVector.

from agno.db.postgres.postgres import PostgresDb
from agno.knowledge.embedder.openai import OpenAIEmbedder
from agno.knowledge.knowledge import Knowledge
from agno.vectordb.pgvector import PgVector, SearchType

db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"

db = PostgresDb(db_url=db_url)

policy_knowledge = Knowledge(
    name="support_policy_knowledge",
    description="Approved support policies, reseller terms, chargeback rules, and support macros.",
    contents_db=db,
    vector_db=PgVector(
        table_name="support_policy_vectors",
        db_url=db_url,
        embedder=OpenAIEmbedder(),
        search_type=SearchType.hybrid,
    ),
)

The important pieces:

contents_db gives AgentOS a place to manage uploaded content and metadata.
PgVector stores embeddings and supports production-friendly Postgres operations.
SearchType.hybrid keeps exact policy terms, IDs, and semantic meaning in the same retrieval path.
You can swap PgVector for Qdrant, Weaviate, Pinecone, LanceDB, Chroma, Redis, or other supported vector stores without rewriting the agent logic.

2. Insert Documents with Metadata

Agno can ingest content into the knowledge base, but you still need to treat metadata as product design.

await policy_knowledge.ainsert(
    name="Refund Policy 2026",
    url="https://internal.example.com/policies/refunds-2026.pdf",
    metadata={
        "status": "approved",
        "audience": "support",
        "owner": "finance-ops",
        "effective_date": "2026-04-01",
        "regions": ["global", "CA", "US"],
        "source_type": "policy",
    },
)

Do not stop at “upload a PDF.” Add the metadata your retrieval policy needs: status, owner, region, product, plan, effective date, source type, and permission scope.

3. Add Product Tools Around the Knowledge Base

Knowledge answers policy questions. Tools answer product-state questions.

from agno.tools import tool


@tool
def get_customer_account(customer_name: str) -> dict:
    """Fetch account facts needed for policy decisions."""
    return {
        "customer": customer_name,
        "plan": "Enterprise",
        "country": "CA",
        "sales_channel": "reseller",
        "renewal_date": "2026-05-12",
        "chargeback_open": True,
    }

This is the correct split:

Knowledge base -> "What do the policies say?"
Product tools  -> "What is true about this customer right now?"
Agent          -> "Given both, what decision is supported?"

4. Force a Structured Decision Output

For product workflows, do not return loose prose. Use structured output so the frontend, audit log, and evals can rely on stable fields.

from pydantic import BaseModel, Field


class PolicyDecision(BaseModel):
    decision: str = Field(description="refund_allowed, deny, or escalate")
    confidence: str = Field(description="low, medium, or high")
    answer: str
    required_escalation: str | None = None
    citations: list[str]
    missing_information: list[str] = []

Agno supports Pydantic-backed structured outputs, so this becomes an API contract rather than a parsing exercise.

5. Build the RAG Agent

from agno.agent import Agent
from agno.models.openai import OpenAIResponses

policy_agent = Agent(
    name="support_policy_agent",
    model=OpenAIResponses(id="gpt-5.2"),
    knowledge=policy_knowledge,
    tools=[get_customer_account],
    output_schema=PolicyDecision,
    instructions=[
        "Answer only from approved policy knowledge and account tool data.",
        "Search the knowledge base before making policy claims.",
        "Cite source IDs for every policy claim.",
        "If required sources are missing or conflict, escalate instead of guessing.",
        "Never approve a refund when chargeback rules require finance review.",
    ],
)

This already gives you the core app behavior:

knowledge search
tool use
grounded answer generation
typed output
traceable source usage

6. Serve It with AgentOS

AgentOS turns the agent into a production API. The current Agno docs describe REST and SSE as default API surfaces, with options for AG-UI, WebSocket, MCP, A2A, Slack, and other interfaces.

from agno.os import AgentOS
from agno.os.interfaces.agui import AGUI

agent_os = AgentOS(
    name="Support Policy OS",
    agents=[policy_agent],
    interfaces=[AGUI(agent=policy_agent)],
    tracing=True,
)

app = agent_os.get_app()

if __name__ == "__main__":
    agent_os.serve(app="support_policy_os:app", reload=True)

That gives you:

FastAPI runtime
REST endpoints
SSE-compatible streaming
session APIs
run APIs
trace capture
AgentOS UI connectivity
AG-UI-compatible frontend integration

In other words, your frontend can be an Astro/React/Next.js app, but the RAG runtime does not need to be hand-rolled.

7. Use Built-In Evals Before Release

Agno has built-in eval dimensions for accuracy, agent-as-judge criteria, performance, and reliability. For this RAG assistant, I would run at least two eval types:

Accuracy eval:
  Input: "ACME renewed through reseller in Canada and has an open chargeback. Can support refund?"
  Expected: decision = escalate, required_escalation = finance

Reliability eval:
  Expected tool calls:
    - get_customer_account
    - search_knowledge_base

The release gate should not be “the answer looked good once.” It should be:

Golden set passes
   +
required tool calls happen
   +
citations are present
   +
structured output validates
   +
trace shows the right knowledge was retrieved

8. Use AgentOS for the Operating Loop

Once deployed, use Agno’s database-backed runtime features as your improvement loop:

Production run
   |
   v
AgentOS stores session + trace + tool calls
   |
   v
Review bad or low-confidence answers
   |
   v
Promote failures into eval examples
   |
   v
Improve metadata, policies, tools, prompt, or model
   |
   v
Rerun evals before release

This is where Agno’s “built-in” features matter most. AgentOS is not only a way to expose an agent as an API. It gives you the runtime records needed to debug, evaluate, and continuously improve the RAG app.

When Agno Is the Right Fit

Use Agno for RAG when:

your backend team is comfortable with Python
you want Agentic RAG without writing the search tool contract from scratch
you want sessions, traces, memory, knowledge, evals, and APIs in one runtime
you need to expose the same agent through REST, streaming, AG-UI, or other interfaces
you want data ownership, with sessions, knowledge, and traces in your database

Be more cautious when:

your company already has a mature TypeScript-only AI runtime
you need very custom retrieval orchestration that fights the framework
your main challenge is search infrastructure, not agent runtime

My practical recommendation: use Agno to own the agent/runtime layer, but keep the knowledge model explicit. Stable IDs, metadata, retrieval tests, and source ownership are still your responsibility.

Evaluation: Measure the Whole RAG Chain

Good RAG evaluation has at least four layers.

Layer	Question	Example metric
Retrieval	Did we retrieve the right evidence?	recall@k, context precision, MRR
Context	Did we pass useful, non-noisy context?	context relevance, redundancy, freshness
Generation	Did the answer use evidence correctly?	faithfulness, groundedness, answer relevance
Product	Did this satisfy the user safely?	task success, escalation rate, CSAT, human review

RAGAS, TruLens, DeepEval, LangSmith, and similar tools all circle around the same core idea: evaluate retrieval and generation separately, then evaluate the end-to-end experience.

Here is a small evaluator sketch:

from dataclasses import dataclass


@dataclass
class RagExample:
    question: str
    required_source_ids: set[str]
    expected_answer_facts: set[str]


@dataclass
class RagRun:
    retrieved_source_ids: list[str]
    answer: str
    cited_source_ids: set[str]


def recall_at_k(example: RagExample, run: RagRun, k: int = 5) -> float:
    retrieved = set(run.retrieved_source_ids[:k])
    if not example.required_source_ids:
        return 1.0
    return len(example.required_source_ids & retrieved) / len(example.required_source_ids)


def citation_coverage(example: RagExample, run: RagRun) -> float:
    if not example.required_source_ids:
        return 1.0
    return len(example.required_source_ids & run.cited_source_ids) / len(example.required_source_ids)


def unsupported_fact_count(answer_claims: set[str], supported_claims: set[str]) -> int:
    return len(answer_claims - supported_claims)

This is intentionally simple. In a real system, answer_claims and supported_claims might come from an LLM judge, a claim extractor, or human labels. The point is that you should not only ask, “Was the answer nice?” Ask whether the required sources were retrieved, cited, and used.

A useful RAG eval set should include more than happy-path questions.

Case type	Why it matters
Known answer	Basic correctness
Missing answer	Refusal and uncertainty
Conflicting sources	Conflict handling
Stale source	Freshness and versioning
Multi-hop question	Chained evidence
Entity ambiguity	Disambiguation
Exact identifier	Keyword and metadata retrieval
Table-heavy answer	Parser and chunk quality
Policy exception	Boundary logic
User typo or vague query	Query rewriting

A practical starting set is 50 to 100 examples:

30 common user questions
20 edge cases from production logs or support tickets
20 adversarial or stale-source cases
10 missing-answer cases
10 multi-hop or relationship questions

Then add failures from production every week.

How to Diagnose Failures

When a RAG answer fails, do not start by changing the prompt. Use the trace.

Bad answer
   |
   v
Was the needed source indexed?
   | no -> ingestion/parser/versioning bug
   v yes
Was it retrieved in top-k?
   | no -> query, embeddings, filters, chunking, hybrid search
   v yes
Was it ranked high enough?
   | no -> reranker, scoring, metadata boost
   v yes
Was it included in final context?
   | no -> context packing, dedupe, token budget
   v yes
Did the model use it correctly?
   | no -> prompt, model, citation rules, post-check
   v yes
Is the expected label wrong?
       -> update dataset

This is where PMs can help a lot. Many failures are not model failures. They are knowledge ownership failures: stale docs, unclear policy, missing source of truth, or contradictory pages.

Best Practices I Would Use in 2026

Here is the short version I would put into a team checklist.

Design the corpus before tuning the model.
Keep source IDs stable and citation-friendly.
Preserve document hierarchy and section paths.
Add metadata for tenant, product, region, version, status, owner, and effective date.
Use hybrid retrieval for production unless you have evidence that vector-only is enough.
Rerank before generation.
Retrieve small, answer with parent context.
Add query rewriting only after you can see retrieval traces.
Treat tables, images, diagrams, and code blocks as special ingestion cases.
Require grounded answers with citations.
Teach the system to say “not found” and “sources conflict.”
Evaluate retrieval, context, generation, and product success separately.
Maintain a living eval set from real failures.
Monitor production drift as documents, user questions, and models change.
Put humans in the loop for policy, legal, medical, financial, or high-impact decisions.

A Simple Decision Guide

Use simple RAG when:

questions are mostly single-hop
docs are clean and already structured
freshness needs are moderate
citations are useful but not legally critical

Use structured/hierarchical RAG when:

users ask about policies, specs, procedures, contracts, or manuals
chunks need parent context
sections, versions, owners, and exceptions matter
table and code boundaries matter

Use GraphRAG or relationship-aware retrieval when:

questions require multi-hop reasoning
answers aggregate across many documents
entity relationships matter
the corpus is narrative, messy, or cross-linked
users ask “why,” “what changed,” “what depends on this,” or “which themes appear across the corpus”

Use fine-tuning when:

the model has the right context but repeatedly uses the wrong style, format, or decision policy
you need consistent behavior more than new knowledge
you have a high-quality representative training set

Do not use RAG as a substitute for fixing the knowledge base. If the source material is contradictory, stale, or ownerless, retrieval will mostly make those problems visible faster.

Open Gaps and Follow-Up Sections

This article focuses on design and evaluation. Three areas deserve deeper treatment:

Security: prompt injection through retrieved documents, source permissions, tenant isolation, and audit logging.
Multimodal RAG: PDFs with tables, screenshots, diagrams, image-heavy manuals, and OCR confidence.
Cost engineering: embedding refresh strategy, caching, reranker cost, context budgets, and latency SLOs.

Those are not side quests. They are usually what decides whether a RAG app survives production traffic.

Source List

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Lewis et al., 2020.
CRAG: Comprehensive RAG Benchmark, Meta and collaborators, 2024.
RGB: Benchmarking Large Language Models in Retrieval-Augmented Generation, Chen et al., 2023.
RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems, Friel et al., 2024.
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation, Ru et al., 2024.
OpenAI: Optimizing LLM Accuracy, especially the guidance on when RAG helps and why retrieval must be evaluated.
OpenAI: Evaluation Best Practices, especially dataset design and scoring against specific criteria.
OpenAI: Vector Embeddings, including embedding dimensions and use cases.
Anthropic: Contextual Retrieval, 2024.
LangChain: Evaluate a RAG Application.
LangChain: Deconstructing RAG.
LangSmith evaluation overview, especially retrieval and generation quality separation.
LlamaIndex: Document Chunking Strategies.
LlamaIndex: Building Performant RAG Applications for Production.
Qdrant: Hybrid Queries.
pgvector: Hybrid Search.
Cohere: Rerank API.
Agno: Welcome, SDK and AgentOS overview.
Agno: Knowledge Overview, including Agentic RAG and traditional RAG modes.
Agno: Agents with Knowledge, including search_knowledge and add_knowledge_to_context.
Agno: Vector Databases, including PgVector hybrid search and Qdrant support.
Agno: Content DB, for AgentOS Knowledge UI support.
Agno: AgentOS Introduction, for production APIs, sessions, knowledge, traces, RBAC, and governance.
Agno: Agent API, for REST, SSE, sessions, memory, and run APIs.
Agno: Structured Output for Agents, for Pydantic-backed response contracts.
Agno: Evals Overview, for accuracy, reliability, performance, and agent-as-judge evals.
Agno: AgentUI, for a self-hostable UI over AgentOS data.
Microsoft GraphRAG documentation.
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval, Sarthi et al., 2024.
LightRAG: Simple and Fast Retrieval-Augmented Generation, Guo et al., 2024.
Ragas metrics documentation.
TruLens RAG Triad.

Luis Mori Guerra

Recent Articles

Topics

RAG Apps in Practice: Structured Knowledge, Better Retrieval, and Real Evals

TL;DR

What You Will Learn Here

RAG Is a Chain, Not a Feature

What Benchmarks Are Telling Us

1. Naive RAG Improves Answers, But Leaves a Big Gap

2. RAG Needs More Than Answer Accuracy

3. Evaluation Needs to Diagnose Components

The Design Rule: Preserve Meaningful Boundaries

A Practical Architecture for Structured RAG

Chunking: Start With the User’s Question, Not Token Size

When Knowledge Is Chained, Retrieve the Chain

Retrieval: Hybrid First, Fancy Later

Generation: Make the Model Show Its Evidence

Practical Example: A Support Policy Assistant

A Recommended Technology Stack

Default Stack for Most Product Teams

Lightweight Stack for an MVP

Heavier Stack for Enterprise Knowledge

Full-Stack RAG with Agno Built-In Features

1. Store Knowledge in Postgres + PgVector

2. Insert Documents with Metadata

3. Add Product Tools Around the Knowledge Base

4. Force a Structured Decision Output

5. Build the RAG Agent

6. Serve It with AgentOS

7. Use Built-In Evals Before Release

8. Use AgentOS for the Operating Loop

When Agno Is the Right Fit

Evaluation: Measure the Whole RAG Chain

How to Diagnose Failures

Best Practices I Would Use in 2026

A Simple Decision Guide

Open Gaps and Follow-Up Sections

Source List

Search the blog

Luis Mori Guerra

Recent Articles

Topics

TL;DR

What You Will Learn Here

RAG Is a Chain, Not a Feature

What Benchmarks Are Telling Us

1. Naive RAG Improves Answers, But Leaves a Big Gap

2. RAG Needs More Than Answer Accuracy

3. Evaluation Needs to Diagnose Components

The Design Rule: Preserve Meaningful Boundaries

A Practical Architecture for Structured RAG

Chunking: Start With the User’s Question, Not Token Size

When Knowledge Is Chained, Retrieve the Chain

Retrieval: Hybrid First, Fancy Later

Generation: Make the Model Show Its Evidence

Practical Example: A Support Policy Assistant

A Recommended Technology Stack

Default Stack for Most Product Teams

Lightweight Stack for an MVP

Heavier Stack for Enterprise Knowledge

Full-Stack RAG with Agno Built-In Features

1. Store Knowledge in Postgres + PgVector

2. Insert Documents with Metadata

3. Add Product Tools Around the Knowledge Base

4. Force a Structured Decision Output

5. Build the RAG Agent

6. Serve It with AgentOS

7. Use Built-In Evals Before Release

8. Use AgentOS for the Operating Loop

When Agno Is the Right Fit

Evaluation: Measure the Whole RAG Chain

The Eval Set PMs and Engineers Can Share

How to Diagnose Failures

Best Practices I Would Use in 2026

A Simple Decision Guide

Open Gaps and Follow-Up Sections

Source List