You can build a capable Agno agent in an afternoon. The harder part is proving it stays good after prompt edits, model swaps, tool changes, retrieval changes, and production traffic. That is where LangWatch fits well.
As of March 31, 2026, the current Agno docs show LangWatch integration through langwatch.setup(instrumentors=[AgnoInstrumentor()]), while the current LangWatch docs position evaluations as three connected layers: offline experiments, online monitors, and real-time guardrails. Put differently: Agno/AgentOS executes the system; LangWatch measures the system.
TL;DR
- Use Agno to build the agent, team, or workflow and AgentOS to serve it in production.
- Use LangWatch tracing first, because every useful evaluation workflow depends on having clean trace data.
- Start with offline experiments on a small golden dataset before you touch production monitors.
- For Agno agents with
output_schema, score business fields, not just free-form text quality. - Keep guardrails synchronous and move non-blocking evaluations into AgentOS background post-hooks.
- Use Agno native evals for framework-local accuracy, reliability, and performance checks; use LangWatch for shared datasets, experiments, monitors, alerts, and cross-run visibility.
What You Will Learn Here
- How LangWatch and Agno/AgentOS divide responsibilities in a production agent stack
- How to instrument an Agno agent so LangWatch receives traces automatically
- How to run offline experiments on Agno agents with built-in evaluators and custom scorers
- How to evaluate structured outputs produced by
output_schema - How to add online evaluation and guardrails without turning your runtime into a latency trap
- How to use AgentOS hooks to run background evaluations safely
- How to close the loop from production traces back into reusable datasets
The Mental Model
Think of the stack like this:
User / API Client
│
▼
Agno Agent / Team / Workflow
│
▼
AgentOS Runtime
- sessions
- auth
- APIs
- hooks
- deployment surface
│
▼
LangWatch
- traces and spans
- offline experiments
- online monitors
- guardrails
- alerts and analytics
This split matters because it prevents a common mistake: trying to make the framework be the evaluation platform, or the evaluation platform be the runtime.
A practical division of labor
| Concern | Best home |
|---|---|
| Agent behavior, tools, sessions, memory | Agno |
| Production API, auth, hooks, runtime control | AgentOS |
| Traces, datasets, experiments, monitors, alerts | LangWatch |
| Framework-local reliability and performance evals | Agno native evals |
| Cross-run product quality and live traffic scoring | LangWatch |
If you remember just one rule, make it this one:
Agno decides what the agent does.
LangWatch decides how well it did.
AgentOS decides how it runs in production.
A Real Use Case We Can Actually Evaluate
We will use a support routing agent because it is realistic and easy to reason about:
- It reads a user message plus retrieved policy context
- It decides the user intent
- It decides whether a human escalation is needed
- It generates a customer-facing reply
- It returns structured output we can score deterministically
That is a good fit for both Agno and LangWatch because it mixes:
- free text quality
- retrieval quality
- business-rule correctness
- runtime observability
Step 1: Instrument Agno with LangWatch
The current Agno LangWatch integration uses OpenInference instrumentation, so setup is small:
uv pip install langwatch agno openai openinference-instrumentation-agno
export LANGWATCH_API_KEY="your-api-key"
import langwatch
from agno.agent import Agent
from agno.models.openai import OpenAIResponses
from openinference.instrumentation.agno import AgnoInstrumentor
langwatch.setup(instrumentors=[AgnoInstrumentor()])
agent = Agent(
name="Support Router",
model=OpenAIResponses(id="gpt-5.2"),
instructions=[
"Classify the user intent.",
"Decide whether the case needs human escalation.",
"Reply using only approved policy context.",
],
debug_mode=True,
)
agent.print_response("I was billed twice. Can you refund one of the charges?")
That is enough to get Agno activity into LangWatch. From there, you can inspect traces before you even add formal evals.
Why tracing comes first
Without traces, your evaluation layer has weak context:
- you do not know which prompt version produced the result
- you do not know which retrieved chunks were used
- you do not know whether a failure came from reasoning, retrieval, or a tool path
With traces, every later evaluation becomes more useful.
Step 2: Make the Agent Return Typed Output
Agno’s output_schema is the right starting point for evaluation because it gives you something scoreable.
If you are on an older Agno release, you may still see response_model in examples. The current docs use output_schema, which is what I use throughout this article.
from typing import Literal
from agno.agent import Agent
from agno.models.openai import OpenAIResponses
from pydantic import BaseModel, Field
class SupportDecision(BaseModel):
intent: Literal["refund", "bug", "shipping", "pricing", "other"]
needs_human: bool = Field(description="True when a human agent should take over")
customer_reply: str
cited_policy_ids: list[str]
support_agent = Agent(
name="Support Router",
model=OpenAIResponses(id="gpt-5.2"),
output_schema=SupportDecision,
instructions=[
"Classify the support intent.",
"Set needs_human=true for billing disputes, legal issues, or unclear policy cases.",
"Use only the provided policy context.",
"Always cite the policy ids you relied on.",
],
)
def route_ticket(user_message: str, policy_context: str) -> SupportDecision:
response = support_agent.run(
f"User message: {user_message}\n\nPolicy context:\n{policy_context}"
)
return response.content
This already improves reliability, but it does not mean your agent is correct. It only means the output is shaped correctly.
That distinction is important:
Schema-valid != business-correct
Step 3: Run Your First Offline Experiment
Now we turn the agent into something measurable.
LangWatch experiments are the right layer for:
- prompt comparisons
- model comparisons
- retrieval comparisons
- regression detection before deployment
A simple golden dataset
You can start from CSV, but LangWatch datasets are more useful long-term because you can also populate them from traces.
import json
import langwatch
df = langwatch.dataset.get_dataset("support-routing-golden-set").to_pandas()
experiment = langwatch.experiment.init("support-routing-v1")
def score_decision(output: SupportDecision, expected_json: str) -> float:
expected = json.loads(expected_json)
checks = [
output.intent == expected["intent"],
output.needs_human == expected["needs_human"],
set(output.cited_policy_ids) == set(expected["cited_policy_ids"]),
]
return sum(checks) / len(checks)
for index, row in experiment.loop(df.iterrows()):
with experiment.target(
"gpt5-baseline",
{"model": "openai/gpt-5.2", "prompt_version": "support-router-v1"},
):
decision = route_ticket(
user_message=row["user_message"],
policy_context=row["policy_context"],
)
experiment.log_response(decision.model_dump_json())
experiment.log(
"routing_accuracy",
index=index,
score=score_decision(decision, row["expected_json"]),
data={
"intent": decision.intent,
"needs_human": decision.needs_human,
"cited_policy_ids": decision.cited_policy_ids,
},
)
experiment.evaluate(
"ragas/faithfulness",
index=index,
data={
"input": row["user_message"],
"output": decision.customer_reply,
"contexts": row["retrieved_chunks"],
},
settings={"model": "openai/gpt-5.2"},
)
experiment.print_summary()
What this experiment is doing
experiment.target(...)makes the run comparable against future models or promptsexperiment.log_response(...)stores the actual output for inspectionexperiment.log(...)records a custom business metricexperiment.evaluate(...)runs a built-in evaluator for response grounding
This inline pattern is easier to teach and easier to trust in a public article. If you later want a more advanced concurrency model with explicit task submission, document that separately and show the corresponding wait or join flow.
This is the first point where LangWatch becomes more than observability. It becomes a release gate.
Step 4: Score the Right Thing for Structured Outputs
Many agent teams make the same mistake here: they only score the reply text.
For structured Agno agents, that is usually the wrong level.
You normally want to score at least four layers:
| Layer | Example question | Best scoring style |
|---|---|---|
| Schema validity | Did the output parse into the expected shape? | Agno output_schema + optional format checks |
| Decision correctness | Did the agent choose the right intent and escalation path? | Custom deterministic scorer |
| Grounding quality | Did the reply stay within retrieved policy context? | Built-in evaluator such as faithfulness |
| User-facing quality | Was the final reply clear and useful? | LLM-as-judge or human review |
That produces a much better evaluation picture than a single “quality score.”
A better scoring flow
Agno output_schema
│
├─ catches malformed structure
│
▼
Custom scorer
- intent correct?
- escalation correct?
- cited policy ids correct?
│
▼
Built-in evaluator
- faithful to retrieved context?
- clear enough for a customer?
The important idea is that business correctness and language quality are separate dimensions.
Step 5: Compare Models, Prompts, or Retrieval Strategies
Once your baseline experiment exists, the next win is comparison.
For example:
gpt5-baselinevsclaude-sonnet- policy retriever v1 vs hybrid retriever v2
- prompt with strict escalation rules vs prompt with softer escalation rules
In LangWatch, this is exactly what target(...) metadata is for.
with experiment.target(
"hybrid-retriever-v2",
{
"model": "openai/gpt-5.2",
"retriever": "hybrid",
"prompt_version": "support-router-v2",
},
):
decision = route_ticket(
user_message=row["user_message"],
policy_context=row["hybrid_policy_context"],
)
experiment.log_response(decision.model_dump_json())
experiment.log(
"routing_accuracy",
index=index,
score=score_decision(decision, row["expected_json"]),
)
This is the difference between “we think v2 feels better” and “v2 improved routing accuracy by 11 points while keeping faithfulness flat.”
That is a better engineering conversation, and it is also a better PM conversation.
Step 6: Add Online Evaluation for Production Traffic
Offline evals tell you if a release candidate is likely safe. They do not tell you what live traffic is doing this afternoon.
LangWatch online evaluation exists for that gap. The current docs describe this as Monitors that score incoming traces and can trigger dashboards or alerts when thresholds are breached.
The flow is simple:
Live request
│
▼
Agno / AgentOS handles request
│
▼
LangWatch trace arrives
│
▼
Monitor runs evaluator
│
├─ score stored
├─ dashboard updated
└─ alert fired if threshold is crossed
Good production monitor candidates
- faithfulness on grounded customer replies
- PII leakage on agent outputs
- jailbreak detection on agent inputs
- custom routing confidence or escalation correctness
The key pattern is:
- offline experiments protect releases
- online monitors protect production
You need both.
Step 7: Add Real-Time Guardrails for Risky Paths
Guardrails are different from monitors because they can change control flow immediately.
For example, if you are exposing an AgentOS API endpoint directly to users, you might want to reject prompt injection or PII leakage before returning a result.
import langwatch
@langwatch.trace(name="support_request")
def handle_support_request(user_message: str, policy_context: str) -> str:
guardrail = langwatch.evaluation.evaluate(
"azure/jailbreak",
name="Jailbreak Detection",
as_guardrail=True,
data={"input": user_message},
)
if not guardrail.passed:
return "I can't help with that request."
decision = route_ticket(user_message, policy_context)
langwatch.evaluation.evaluate(
"presidio/pii_detection",
name="PII Output Check",
data={"output": decision.customer_reply},
)
return decision.customer_reply
Use guardrails sparingly. They are part of the request path, so they affect latency and availability.
A good rule:
If the score must block the response, keep it synchronous.
If the score is for analytics or continuous improvement, move it out of band.
Step 8: Use AgentOS Background Hooks for Non-Blocking Evaluation
This is where Agno and AgentOS become especially useful.
The current Agno hooks docs explicitly support pre-hooks and post-hooks, and AgentOS can run hooks in the background with @hook(run_in_background=True). The same docs also warn that background hooks are not the right place for guardrails because they cannot modify the request or response after the fact.
That gives you a clean production pattern:
- synchronous guardrails for things that must block
- background post-hooks for scoring, logging, alerts, and analytics
Example: background post-hook that sends custom evaluation data to LangWatch
import langwatch
from agno.hooks import hook
def customer_reply_score(output: SupportDecision) -> float:
score = 1.0
if len(output.customer_reply) < 40:
score -= 0.2
if not output.cited_policy_ids:
score -= 0.4
if output.intent == "refund" and not output.needs_human:
score -= 0.4
return max(score, 0.0)
@hook(run_in_background=True)
async def evaluate_output_in_background(run_output, session=None, user_id=None):
decision: SupportDecision = run_output.content
with langwatch.trace(name="support-output-eval"):
score = customer_reply_score(decision)
langwatch.get_current_span().add_evaluation(
name="business_rules",
score=score,
passed=score >= 0.8,
details={
"intent": decision.intent,
"needs_human": decision.needs_human,
"policy_ids": decision.cited_policy_ids,
"session_id": getattr(session, "session_id", None),
"user_id": user_id,
},
)
Attach the hook to your Agno agent:
support_agent = Agent(
name="Support Router",
model=OpenAIResponses(id="gpt-5.2"),
output_schema=SupportDecision,
post_hooks=[evaluate_output_in_background],
)
And if you are serving through AgentOS:
from agno.os import AgentOS
agent_os = AgentOS(
name="Support AgentOS",
agents=[support_agent],
tracing=True,
)
app = agent_os.get_app()
Why this pattern works
- the user gets the response immediately
- the evaluation still runs on real production outputs
- LangWatch still receives structured evaluation data
- you can alert on low scores without slowing down the core request
For most agent products, this is the production sweet spot.
Step 9: Feed Production Failures Back into Datasets
The current LangWatch dataset docs support continuously populating datasets from traces, which is one of the most useful features in the whole system.
That creates a quality loop:
Production traffic
│
▼
LangWatch traces
│
▼
Monitors find low-scoring runs
│
▼
Add those traces into a dataset
│
▼
Re-run experiments before the next release
This is how an eval stack stops being a dashboard and becomes a learning system.
In practice, this is what I recommend:
- Start with a hand-curated golden set of 30 to 50 rows.
- Add a second dataset for real production misses.
- Promote recurring failures from production into the golden set once they are understood.
That keeps your eval suite grounded in real traffic instead of synthetic optimism.
Step 10: Know When to Use Agno Native Evals Too
Agno now has its own eval system for accuracy, agent-as-judge, performance, and reliability. That is not a reason to skip LangWatch. It is a reason to be deliberate.
Here is the practical split:
| Use case | Better first tool |
|---|---|
| Check whether an Agno agent uses the right tool or handles rate limits correctly | Agno reliability evals |
| Measure latency and memory for framework-level regressions | Agno performance evals |
| Score business quality across a shared dataset and compare targets in a UI | LangWatch experiments |
| Monitor live traffic and alert on low scores | LangWatch monitors |
| Run request-path guardrails | LangWatch evaluators and Agno hooks together |
My preferred production setup is:
- use Agno native evals for framework-local behavior and performance
- use LangWatch for release gating, shared datasets, production monitoring, and cross-run quality analytics
That avoids duplicating responsibilities while still taking advantage of both tools.
An Advanced Pattern: End-to-End Agent Behavior Testing
Offline row-based experiments are not enough for every agent. If your Agno system is multi-turn, tool-heavy, or workflow-based, you eventually want scenario-style testing too.
LangWatch’s Scenario testing layer is built for that higher level:
Level 1: unit and integration tests
Level 2: offline evals on components and outputs
Level 3: scenario tests for full multi-turn behavior
A useful pattern is:
- use LangWatch experiments for row-level output quality
- use Scenario for full conversation behavior
- use AgentOS hooks and monitors for production drift
That gives you coverage across the whole stack instead of trying to force one test shape to do everything.
A Rollout Plan from Zero to Advanced
If you are starting from scratch, I would roll this out in this order:
- Instrument Agno with LangWatch and inspect a few traces manually.
- Add
output_schemato the agent so the output is typed. - Build one small golden dataset in LangWatch.
- Add one custom deterministic scorer for business correctness.
- Add one built-in evaluator for grounding, safety, or clarity.
- Compare two targets: prompt versions, models, or retrievers.
- Add one synchronous guardrail for the highest-risk path.
- Move non-critical scoring into AgentOS background post-hooks.
- Continuously add low-scoring traces back into a dataset.
- Add scenario tests if the system is multi-turn or workflow-heavy.
That path gives you useful signal quickly without building a giant evaluation platform all at once.
Common Mistakes
- Treating tracing and evaluation as the same thing. Tracing is the substrate; evaluation is the judgment layer.
- Scoring only the final reply text while ignoring structured business fields.
- Using guardrails for everything, then wondering why latency spiked.
- Pushing background hooks into control-flow decisions they cannot safely enforce.
- Building a synthetic dataset and never refreshing it with production misses.
- Comparing prompts or models without attaching metadata to targets.
Final Take
If you are using Agno and AgentOS, LangWatch is not just “another dashboard.” It is the part of the stack that lets you move from anecdotal quality to measurable quality.
The zero-to-advanced path is straightforward:
- instrument first
- type the outputs
- run batch experiments
- add online monitors
- keep blocking logic synchronous
- push non-blocking evaluation into AgentOS background hooks
- continuously recycle production failures into datasets
That combination is strong because each layer stays focused on what it is best at:
Agno builds the agent.
AgentOS runs the agent.
LangWatch tells you whether the agent is actually getting better.
Other Alternatives to LangWatch
LangWatch is a strong fit for teams that want one platform for tracing, datasets, experiments, monitors, and guardrails. It is not the only good option.
If you are evaluating platforms for an Agno or AgentOS stack, these are the alternatives I would look at first:
| Platform | Best fit | Why you might choose it |
|---|---|---|
| LangSmith | LangChain or LangGraph-heavy stacks | Strong observability and eval workflows, plus a very natural fit if your agents already live in the LangChain ecosystem |
| Langfuse | Teams that want an open-source LLM engineering platform with prompt management and evals | Good choice if you want tracing, prompts, datasets, experiments, and production scores in an OSS-friendly platform |
| Braintrust | Eval-first teams that care deeply about experiment comparison and running evals on their own infrastructure | Especially strong when evaluation is your main workflow and you want remote evals for agentic or compliance-sensitive systems |
| Arize Phoenix | Teams that want deep trace inspection plus code-driven or LLM-based evals | A strong option when you want to work close to traces and annotate or score spans directly |
| Helicone | Gateway and observability-first teams that want to centralize scores from other eval systems | Useful when you want logging, routing, and experimentation analytics, but do not necessarily want your observability tool to be your primary evaluator runtime |
A simple selection heuristic
If your main problem is "evaluate and monitor my production agent stack":
LangWatch is a strong default
If your main problem is "my stack is already LangChain / LangGraph":
LangSmith is the easiest natural fit
If your main problem is "I want open source plus a broad platform":
Langfuse or Phoenix are both worth serious consideration
If your main problem is "I want evals to be the center of the workflow":
Braintrust is especially compelling
If your main problem is "I want gateway + observability and I can bring my own evaluators":
Helicone is a practical alternative
The important part is not picking the “best” brand. It is matching the platform to the role it will play in your system:
- runtime-adjacent evaluation
- experiment workflow
- trace debugging
- gateway analytics
- prompt and dataset management
For Agno and AgentOS users specifically, LangWatch stays attractive because it maps cleanly onto the same production concerns this article focused on: traces, offline experiments, online monitors, and guardrails. But if your organization is already standardized on one of the alternatives above, you can absolutely build a strong evaluation practice there too.
Source List
- LangWatch Evaluations Overview
- LangWatch Experiments via SDK
- LangWatch Built-in Evaluators
- LangWatch Custom Scoring
- LangWatch Online Evaluation Overview
- LangWatch Datasets Overview
- LangWatch Automatically Build Datasets from Traces
- LangWatch Structured Outputs Use Case
- Agno LangWatch Integration
- Agno Agent with Structured Output
- Agno Evals Overview
- Agno Agent as Judge Evals
- Agno Hooks Overview
- Agno AgentOS Overview
- Agno Background Hooks in AgentOS
- LangSmith Docs Overview
- Langfuse Docs Overview
- Braintrust Evaluation Quickstart
- Braintrust Observe Your Application
- Braintrust Remote Evals
- Phoenix Client-Side Evals SDK
- Phoenix Running Evals on Traces
- Helicone Quickstart
- Helicone Eval Scores