AI Quality & Evaluation

Implementing LangWatch Evaluations with Agno and AgentOS: From Zero to Production

A practical guide for implementing LangWatch evaluations in Agno and AgentOS systems, from first traces and batch experiments to structured-output scoring, production monitors, and background evaluation hooks.

17 min read Updated Mar 31, 2026

You can build a capable Agno agent in an afternoon. The harder part is proving it stays good after prompt edits, model swaps, tool changes, retrieval changes, and production traffic. That is where LangWatch fits well.

As of March 31, 2026, the current Agno docs show LangWatch integration through langwatch.setup(instrumentors=[AgnoInstrumentor()]), while the current LangWatch docs position evaluations as three connected layers: offline experiments, online monitors, and real-time guardrails. Put differently: Agno/AgentOS executes the system; LangWatch measures the system.

TL;DR

  • Use Agno to build the agent, team, or workflow and AgentOS to serve it in production.
  • Use LangWatch tracing first, because every useful evaluation workflow depends on having clean trace data.
  • Start with offline experiments on a small golden dataset before you touch production monitors.
  • For Agno agents with output_schema, score business fields, not just free-form text quality.
  • Keep guardrails synchronous and move non-blocking evaluations into AgentOS background post-hooks.
  • Use Agno native evals for framework-local accuracy, reliability, and performance checks; use LangWatch for shared datasets, experiments, monitors, alerts, and cross-run visibility.

What You Will Learn Here

  • How LangWatch and Agno/AgentOS divide responsibilities in a production agent stack
  • How to instrument an Agno agent so LangWatch receives traces automatically
  • How to run offline experiments on Agno agents with built-in evaluators and custom scorers
  • How to evaluate structured outputs produced by output_schema
  • How to add online evaluation and guardrails without turning your runtime into a latency trap
  • How to use AgentOS hooks to run background evaluations safely
  • How to close the loop from production traces back into reusable datasets

The Mental Model

Think of the stack like this:

User / API Client


Agno Agent / Team / Workflow


AgentOS Runtime
  - sessions
  - auth
  - APIs
  - hooks
  - deployment surface


LangWatch
  - traces and spans
  - offline experiments
  - online monitors
  - guardrails
  - alerts and analytics

This split matters because it prevents a common mistake: trying to make the framework be the evaluation platform, or the evaluation platform be the runtime.

A practical division of labor

ConcernBest home
Agent behavior, tools, sessions, memoryAgno
Production API, auth, hooks, runtime controlAgentOS
Traces, datasets, experiments, monitors, alertsLangWatch
Framework-local reliability and performance evalsAgno native evals
Cross-run product quality and live traffic scoringLangWatch

If you remember just one rule, make it this one:

Agno decides what the agent does.
LangWatch decides how well it did.
AgentOS decides how it runs in production.

A Real Use Case We Can Actually Evaluate

We will use a support routing agent because it is realistic and easy to reason about:

  • It reads a user message plus retrieved policy context
  • It decides the user intent
  • It decides whether a human escalation is needed
  • It generates a customer-facing reply
  • It returns structured output we can score deterministically

That is a good fit for both Agno and LangWatch because it mixes:

  • free text quality
  • retrieval quality
  • business-rule correctness
  • runtime observability

Step 1: Instrument Agno with LangWatch

The current Agno LangWatch integration uses OpenInference instrumentation, so setup is small:

uv pip install langwatch agno openai openinference-instrumentation-agno
export LANGWATCH_API_KEY="your-api-key"
import langwatch

from agno.agent import Agent
from agno.models.openai import OpenAIResponses
from openinference.instrumentation.agno import AgnoInstrumentor

langwatch.setup(instrumentors=[AgnoInstrumentor()])

agent = Agent(
    name="Support Router",
    model=OpenAIResponses(id="gpt-5.2"),
    instructions=[
        "Classify the user intent.",
        "Decide whether the case needs human escalation.",
        "Reply using only approved policy context.",
    ],
    debug_mode=True,
)

agent.print_response("I was billed twice. Can you refund one of the charges?")

That is enough to get Agno activity into LangWatch. From there, you can inspect traces before you even add formal evals.

Why tracing comes first

Without traces, your evaluation layer has weak context:

  • you do not know which prompt version produced the result
  • you do not know which retrieved chunks were used
  • you do not know whether a failure came from reasoning, retrieval, or a tool path

With traces, every later evaluation becomes more useful.

Step 2: Make the Agent Return Typed Output

Agno’s output_schema is the right starting point for evaluation because it gives you something scoreable.

If you are on an older Agno release, you may still see response_model in examples. The current docs use output_schema, which is what I use throughout this article.

from typing import Literal

from agno.agent import Agent
from agno.models.openai import OpenAIResponses
from pydantic import BaseModel, Field


class SupportDecision(BaseModel):
    intent: Literal["refund", "bug", "shipping", "pricing", "other"]
    needs_human: bool = Field(description="True when a human agent should take over")
    customer_reply: str
    cited_policy_ids: list[str]


support_agent = Agent(
    name="Support Router",
    model=OpenAIResponses(id="gpt-5.2"),
    output_schema=SupportDecision,
    instructions=[
        "Classify the support intent.",
        "Set needs_human=true for billing disputes, legal issues, or unclear policy cases.",
        "Use only the provided policy context.",
        "Always cite the policy ids you relied on.",
    ],
)


def route_ticket(user_message: str, policy_context: str) -> SupportDecision:
    response = support_agent.run(
        f"User message: {user_message}\n\nPolicy context:\n{policy_context}"
    )
    return response.content

This already improves reliability, but it does not mean your agent is correct. It only means the output is shaped correctly.

That distinction is important:

Schema-valid != business-correct

Step 3: Run Your First Offline Experiment

Now we turn the agent into something measurable.

LangWatch experiments are the right layer for:

  • prompt comparisons
  • model comparisons
  • retrieval comparisons
  • regression detection before deployment

A simple golden dataset

You can start from CSV, but LangWatch datasets are more useful long-term because you can also populate them from traces.

import json
import langwatch


df = langwatch.dataset.get_dataset("support-routing-golden-set").to_pandas()
experiment = langwatch.experiment.init("support-routing-v1")


def score_decision(output: SupportDecision, expected_json: str) -> float:
    expected = json.loads(expected_json)

    checks = [
        output.intent == expected["intent"],
        output.needs_human == expected["needs_human"],
        set(output.cited_policy_ids) == set(expected["cited_policy_ids"]),
    ]
    return sum(checks) / len(checks)


for index, row in experiment.loop(df.iterrows()):
    with experiment.target(
        "gpt5-baseline",
        {"model": "openai/gpt-5.2", "prompt_version": "support-router-v1"},
    ):
        decision = route_ticket(
            user_message=row["user_message"],
            policy_context=row["policy_context"],
        )

        experiment.log_response(decision.model_dump_json())

        experiment.log(
            "routing_accuracy",
            index=index,
            score=score_decision(decision, row["expected_json"]),
            data={
                "intent": decision.intent,
                "needs_human": decision.needs_human,
                "cited_policy_ids": decision.cited_policy_ids,
            },
        )

        experiment.evaluate(
            "ragas/faithfulness",
            index=index,
            data={
                "input": row["user_message"],
                "output": decision.customer_reply,
                "contexts": row["retrieved_chunks"],
            },
            settings={"model": "openai/gpt-5.2"},
        )


experiment.print_summary()

What this experiment is doing

  • experiment.target(...) makes the run comparable against future models or prompts
  • experiment.log_response(...) stores the actual output for inspection
  • experiment.log(...) records a custom business metric
  • experiment.evaluate(...) runs a built-in evaluator for response grounding

This inline pattern is easier to teach and easier to trust in a public article. If you later want a more advanced concurrency model with explicit task submission, document that separately and show the corresponding wait or join flow.

This is the first point where LangWatch becomes more than observability. It becomes a release gate.

Step 4: Score the Right Thing for Structured Outputs

Many agent teams make the same mistake here: they only score the reply text.

For structured Agno agents, that is usually the wrong level.

You normally want to score at least four layers:

LayerExample questionBest scoring style
Schema validityDid the output parse into the expected shape?Agno output_schema + optional format checks
Decision correctnessDid the agent choose the right intent and escalation path?Custom deterministic scorer
Grounding qualityDid the reply stay within retrieved policy context?Built-in evaluator such as faithfulness
User-facing qualityWas the final reply clear and useful?LLM-as-judge or human review

That produces a much better evaluation picture than a single “quality score.”

A better scoring flow

Agno output_schema

      ├─ catches malformed structure


Custom scorer
  - intent correct?
  - escalation correct?
  - cited policy ids correct?


Built-in evaluator
  - faithful to retrieved context?
  - clear enough for a customer?

The important idea is that business correctness and language quality are separate dimensions.

Step 5: Compare Models, Prompts, or Retrieval Strategies

Once your baseline experiment exists, the next win is comparison.

For example:

  • gpt5-baseline vs claude-sonnet
  • policy retriever v1 vs hybrid retriever v2
  • prompt with strict escalation rules vs prompt with softer escalation rules

In LangWatch, this is exactly what target(...) metadata is for.

with experiment.target(
    "hybrid-retriever-v2",
    {
        "model": "openai/gpt-5.2",
        "retriever": "hybrid",
        "prompt_version": "support-router-v2",
    },
):
    decision = route_ticket(
        user_message=row["user_message"],
        policy_context=row["hybrid_policy_context"],
    )

    experiment.log_response(decision.model_dump_json())
    experiment.log(
        "routing_accuracy",
        index=index,
        score=score_decision(decision, row["expected_json"]),
    )

This is the difference between “we think v2 feels better” and “v2 improved routing accuracy by 11 points while keeping faithfulness flat.”

That is a better engineering conversation, and it is also a better PM conversation.

Step 6: Add Online Evaluation for Production Traffic

Offline evals tell you if a release candidate is likely safe. They do not tell you what live traffic is doing this afternoon.

LangWatch online evaluation exists for that gap. The current docs describe this as Monitors that score incoming traces and can trigger dashboards or alerts when thresholds are breached.

The flow is simple:

Live request


Agno / AgentOS handles request


LangWatch trace arrives


Monitor runs evaluator

   ├─ score stored
   ├─ dashboard updated
   └─ alert fired if threshold is crossed

Good production monitor candidates

  • faithfulness on grounded customer replies
  • PII leakage on agent outputs
  • jailbreak detection on agent inputs
  • custom routing confidence or escalation correctness

The key pattern is:

  • offline experiments protect releases
  • online monitors protect production

You need both.

Step 7: Add Real-Time Guardrails for Risky Paths

Guardrails are different from monitors because they can change control flow immediately.

For example, if you are exposing an AgentOS API endpoint directly to users, you might want to reject prompt injection or PII leakage before returning a result.

import langwatch


@langwatch.trace(name="support_request")
def handle_support_request(user_message: str, policy_context: str) -> str:
    guardrail = langwatch.evaluation.evaluate(
        "azure/jailbreak",
        name="Jailbreak Detection",
        as_guardrail=True,
        data={"input": user_message},
    )

    if not guardrail.passed:
        return "I can't help with that request."

    decision = route_ticket(user_message, policy_context)

    langwatch.evaluation.evaluate(
        "presidio/pii_detection",
        name="PII Output Check",
        data={"output": decision.customer_reply},
    )

    return decision.customer_reply

Use guardrails sparingly. They are part of the request path, so they affect latency and availability.

A good rule:

If the score must block the response, keep it synchronous.
If the score is for analytics or continuous improvement, move it out of band.

Step 8: Use AgentOS Background Hooks for Non-Blocking Evaluation

This is where Agno and AgentOS become especially useful.

The current Agno hooks docs explicitly support pre-hooks and post-hooks, and AgentOS can run hooks in the background with @hook(run_in_background=True). The same docs also warn that background hooks are not the right place for guardrails because they cannot modify the request or response after the fact.

That gives you a clean production pattern:

  • synchronous guardrails for things that must block
  • background post-hooks for scoring, logging, alerts, and analytics

Example: background post-hook that sends custom evaluation data to LangWatch

import langwatch

from agno.hooks import hook


def customer_reply_score(output: SupportDecision) -> float:
    score = 1.0
    if len(output.customer_reply) < 40:
        score -= 0.2
    if not output.cited_policy_ids:
        score -= 0.4
    if output.intent == "refund" and not output.needs_human:
        score -= 0.4
    return max(score, 0.0)


@hook(run_in_background=True)
async def evaluate_output_in_background(run_output, session=None, user_id=None):
    decision: SupportDecision = run_output.content

    with langwatch.trace(name="support-output-eval"):
        score = customer_reply_score(decision)

        langwatch.get_current_span().add_evaluation(
            name="business_rules",
            score=score,
            passed=score >= 0.8,
            details={
                "intent": decision.intent,
                "needs_human": decision.needs_human,
                "policy_ids": decision.cited_policy_ids,
                "session_id": getattr(session, "session_id", None),
                "user_id": user_id,
            },
        )

Attach the hook to your Agno agent:

support_agent = Agent(
    name="Support Router",
    model=OpenAIResponses(id="gpt-5.2"),
    output_schema=SupportDecision,
    post_hooks=[evaluate_output_in_background],
)

And if you are serving through AgentOS:

from agno.os import AgentOS

agent_os = AgentOS(
    name="Support AgentOS",
    agents=[support_agent],
    tracing=True,
)

app = agent_os.get_app()

Why this pattern works

  • the user gets the response immediately
  • the evaluation still runs on real production outputs
  • LangWatch still receives structured evaluation data
  • you can alert on low scores without slowing down the core request

For most agent products, this is the production sweet spot.

Step 9: Feed Production Failures Back into Datasets

The current LangWatch dataset docs support continuously populating datasets from traces, which is one of the most useful features in the whole system.

That creates a quality loop:

Production traffic


LangWatch traces


Monitors find low-scoring runs


Add those traces into a dataset


Re-run experiments before the next release

This is how an eval stack stops being a dashboard and becomes a learning system.

In practice, this is what I recommend:

  1. Start with a hand-curated golden set of 30 to 50 rows.
  2. Add a second dataset for real production misses.
  3. Promote recurring failures from production into the golden set once they are understood.

That keeps your eval suite grounded in real traffic instead of synthetic optimism.

Step 10: Know When to Use Agno Native Evals Too

Agno now has its own eval system for accuracy, agent-as-judge, performance, and reliability. That is not a reason to skip LangWatch. It is a reason to be deliberate.

Here is the practical split:

Use caseBetter first tool
Check whether an Agno agent uses the right tool or handles rate limits correctlyAgno reliability evals
Measure latency and memory for framework-level regressionsAgno performance evals
Score business quality across a shared dataset and compare targets in a UILangWatch experiments
Monitor live traffic and alert on low scoresLangWatch monitors
Run request-path guardrailsLangWatch evaluators and Agno hooks together

My preferred production setup is:

  • use Agno native evals for framework-local behavior and performance
  • use LangWatch for release gating, shared datasets, production monitoring, and cross-run quality analytics

That avoids duplicating responsibilities while still taking advantage of both tools.

An Advanced Pattern: End-to-End Agent Behavior Testing

Offline row-based experiments are not enough for every agent. If your Agno system is multi-turn, tool-heavy, or workflow-based, you eventually want scenario-style testing too.

LangWatch’s Scenario testing layer is built for that higher level:

Level 1: unit and integration tests
Level 2: offline evals on components and outputs
Level 3: scenario tests for full multi-turn behavior

A useful pattern is:

  • use LangWatch experiments for row-level output quality
  • use Scenario for full conversation behavior
  • use AgentOS hooks and monitors for production drift

That gives you coverage across the whole stack instead of trying to force one test shape to do everything.

A Rollout Plan from Zero to Advanced

If you are starting from scratch, I would roll this out in this order:

  1. Instrument Agno with LangWatch and inspect a few traces manually.
  2. Add output_schema to the agent so the output is typed.
  3. Build one small golden dataset in LangWatch.
  4. Add one custom deterministic scorer for business correctness.
  5. Add one built-in evaluator for grounding, safety, or clarity.
  6. Compare two targets: prompt versions, models, or retrievers.
  7. Add one synchronous guardrail for the highest-risk path.
  8. Move non-critical scoring into AgentOS background post-hooks.
  9. Continuously add low-scoring traces back into a dataset.
  10. Add scenario tests if the system is multi-turn or workflow-heavy.

That path gives you useful signal quickly without building a giant evaluation platform all at once.

Common Mistakes

  • Treating tracing and evaluation as the same thing. Tracing is the substrate; evaluation is the judgment layer.
  • Scoring only the final reply text while ignoring structured business fields.
  • Using guardrails for everything, then wondering why latency spiked.
  • Pushing background hooks into control-flow decisions they cannot safely enforce.
  • Building a synthetic dataset and never refreshing it with production misses.
  • Comparing prompts or models without attaching metadata to targets.

Final Take

If you are using Agno and AgentOS, LangWatch is not just “another dashboard.” It is the part of the stack that lets you move from anecdotal quality to measurable quality.

The zero-to-advanced path is straightforward:

  • instrument first
  • type the outputs
  • run batch experiments
  • add online monitors
  • keep blocking logic synchronous
  • push non-blocking evaluation into AgentOS background hooks
  • continuously recycle production failures into datasets

That combination is strong because each layer stays focused on what it is best at:

Agno builds the agent.
AgentOS runs the agent.
LangWatch tells you whether the agent is actually getting better.

Other Alternatives to LangWatch

LangWatch is a strong fit for teams that want one platform for tracing, datasets, experiments, monitors, and guardrails. It is not the only good option.

If you are evaluating platforms for an Agno or AgentOS stack, these are the alternatives I would look at first:

PlatformBest fitWhy you might choose it
LangSmithLangChain or LangGraph-heavy stacksStrong observability and eval workflows, plus a very natural fit if your agents already live in the LangChain ecosystem
LangfuseTeams that want an open-source LLM engineering platform with prompt management and evalsGood choice if you want tracing, prompts, datasets, experiments, and production scores in an OSS-friendly platform
BraintrustEval-first teams that care deeply about experiment comparison and running evals on their own infrastructureEspecially strong when evaluation is your main workflow and you want remote evals for agentic or compliance-sensitive systems
Arize PhoenixTeams that want deep trace inspection plus code-driven or LLM-based evalsA strong option when you want to work close to traces and annotate or score spans directly
HeliconeGateway and observability-first teams that want to centralize scores from other eval systemsUseful when you want logging, routing, and experimentation analytics, but do not necessarily want your observability tool to be your primary evaluator runtime

A simple selection heuristic

If your main problem is "evaluate and monitor my production agent stack":
  LangWatch is a strong default

If your main problem is "my stack is already LangChain / LangGraph":
  LangSmith is the easiest natural fit

If your main problem is "I want open source plus a broad platform":
  Langfuse or Phoenix are both worth serious consideration

If your main problem is "I want evals to be the center of the workflow":
  Braintrust is especially compelling

If your main problem is "I want gateway + observability and I can bring my own evaluators":
  Helicone is a practical alternative

The important part is not picking the “best” brand. It is matching the platform to the role it will play in your system:

  • runtime-adjacent evaluation
  • experiment workflow
  • trace debugging
  • gateway analytics
  • prompt and dataset management

For Agno and AgentOS users specifically, LangWatch stays attractive because it maps cleanly onto the same production concerns this article focused on: traces, offline experiments, online monitors, and guardrails. But if your organization is already standardized on one of the alternatives above, you can absolutely build a strong evaluation practice there too.

Source List