As of April 28, 2026, the official Agno docs describe Evals as a native way to measure the quality of Agno Agents and Teams across four dimensions: accuracy, agent-as-judge, performance, and reliability. The official LangWatch Scenario docs describe Scenario as a simulation-based agent testing framework that tests real agent behavior through simulated users, judges, scripts, and multi-turn conversations.
That difference matters. These tools are not duplicates. They answer different quality questions.
Agno Evals ask: did this Agno Agent or Team produce the expected result, meet a rubric, stay performant, or call the expected tools?
LangWatch Scenario asks: can this agent survive a realistic situation over multiple turns, with a user who may clarify, change direction, provide missing details, or expose an edge case?
The best production answer is usually not “pick one.” It is:
Use Agno Evals to measure agent components and framework-local behavior.
Use LangWatch Scenario to rehearse end-to-end user journeys before release.
Promote failures from either layer into repeatable tests.
TL;DR
- Agno Evals are now a first-class Agno production feature for scoring Agents and Teams.
- Agno’s documented eval dimensions are accuracy, agent-as-judge, performance, and reliability.
- LangWatch Scenario is simulation-first. It tests complete agent behavior with an agent under test, a user simulator, and a judge.
- Agno Evals are strongest when the target is a known input/output, rubric, tool-call expectation, latency profile, memory footprint, or Team behavior.
- LangWatch Scenario is strongest when the target is a user journey, ambiguity handling, clarification flow, tool recovery, or multi-turn regression.
- The clean mental model: Agno Evals are measurement probes; Scenario tests are behavioral rehearsals.
The Evidence From The Docs
Here is the source trail I used for this comparison:
| Claim | Evidence |
|---|---|
| Agno Evals measure Agents and Teams across multiple dimensions | Agno Evals overview |
| Agno’s documented dimensions are accuracy, agent-as-judge, performance, and reliability | Agno Evals overview and Agno eval examples overview |
| Accuracy evals compare actual responses against expected outputs using an evaluator model | Agno Accuracy Evals |
| Agent-as-judge evals score custom criteria such as clarity, factual accuracy, tone, or friendliness | Agno Agent as Judge Evals |
| Reliability evals validate expected tool calls for Agents and Teams | Agno Reliability Evals |
| Agno eval runs can be stored with a database and exposed through AgentOS eval routes | Agno Accuracy Evals: Track Evals in your AgentOS |
| Scenario is simulation-based and framework-agnostic | Scenario introduction |
| Scenario uses an AgentAdapter, UserSimulatorAgent, JudgeAgent, and optional script | Scenario getting started |
| Scenario integrates with Agno through the AgentAdapter interface and needs session-aware message handling | Scenario Agno integration |
| Scenario’s testing pyramid separates unit tests, evals and optimization, and simulations | Scenario Agent Testing Pyramid |
My interpretation: Agno has moved evaluation closer to the agent runtime itself. LangWatch Scenario has moved testing closer to real user behavior. That is a useful split.
What Agno Evals Are Optimized For
Agno Evals live inside the same world as the Agno Agent or Team. The examples use Agno classes such as AccuracyEval, AgentAsJudgeEval, PerformanceEval, and ReliabilityEval, and the evaluated target can be an Agent, a Team, or a captured run output.
That makes them a natural fit for framework-local quality checks:
| Evaluation Need | Agno Fit |
|---|---|
| ”Did the agent answer the math question correctly?” | Accuracy eval |
| ”Did the answer meet a writing rubric?” | Agent-as-judge eval |
| ”Did this version get slower or use more memory?” | Performance eval |
”Did the agent call factorial when solving 10!?” | Reliability eval |
| ”Did the Team delegate to the right member/tool path?” | Reliability eval on Team run output |
| ”Can I store eval results alongside Agno runtime data?” | Agno evals with database and AgentOS |
The important point is that Agno Evals are close to the agent implementation. You can score the exact object you are building.
A Minimal Agno Eval Pattern
The current Agno docs show accuracy evals with a model, an agent, an input, an expected output, and optional guidelines. In practice, that pattern looks like this:
from agno.agent import Agent
from agno.eval.accuracy import AccuracyEval
from agno.models.openai import OpenAIResponses
from agno.tools.calculator import CalculatorTools
agent = Agent(
model=OpenAIResponses(id="gpt-5.2"),
tools=[CalculatorTools()],
)
evaluation = AccuracyEval(
name="Calculator Evaluation",
model=OpenAIResponses(id="gpt-5.2"),
agent=agent,
input="What is 10*5 then to the power of 2?",
expected_output="2500",
additional_guidelines="The answer should include the final result.",
num_iterations=3,
)
result = evaluation.run(print_results=True)
assert result is not None and result.avg_score >= 8
That is a good shape when you already know the question and the expected answer.
Where Agno Evals Shine
Agno Evals are especially good for:
- scoring single-turn or bounded tasks
- comparing model choices inside an Agno Agent
- checking structured output quality
- validating that expected tools were called
- measuring latency and memory after prompt or tool changes
- storing eval runs in the same database-backed production system
- exposing evaluation history through AgentOS
If I am building an Agno support router, I want native evals for things like:
- intent classification accuracy
- escalation decision correctness
- expected tool-call checks
- response rubric scoring
- Team routing behavior
- latency budget regressions
These are specific probes. They tell me whether a known component behaves well under known pressure.
What LangWatch Scenario Is Optimized For
LangWatch Scenario sits one level higher. It is not asking only “was this answer correct?” It asks whether the full agent can handle a situation.
The docs describe a test structure with:
- an AgentAdapter that connects your real agent
- a UserSimulatorAgent that role-plays the user
- a JudgeAgent that evaluates the conversation against criteria
- an optional script that controls the turn sequence
set_idand batch identifiers for grouping results- caching and debug modes to make non-deterministic tests easier to repeat
- LangWatch visualization for simulation runs
That shape is much closer to end-to-end testing than to a scorecard.
A Minimal Scenario Pattern
Scenario’s getting-started docs show a pattern like this:
import pytest
import scenario
scenario.configure(default_model="openai/gpt-4.1-mini")
@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_billing_dispute_flow():
class SupportAgent(scenario.AgentAdapter):
async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
return support_agent.run(
input.last_new_user_message_str(),
session_id=input.thread_id,
).content
result = await scenario.run(
name="billing dispute needs account lookup",
description="""
User believes they were charged twice.
They are frustrated but cooperative.
They want a refund or a clear explanation.
""",
agents=[
SupportAgent(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(criteria=[
"Agent asks for enough account information before taking action",
"Agent explains billing policy clearly",
"Agent does not promise a refund before verification",
"Agent escalates if the case cannot be resolved safely",
]),
],
max_turns=8,
)
assert result.success
This test can fail even when the final answer looks good. For example, it can catch that the agent promised a refund too early, skipped verification, or kept asking redundant questions.
Where Scenario Shines
Scenario is strongest for:
- multi-turn agent behavior
- ambiguous user requests
- clarification and confirmation flows
- realistic edge cases
- recovery after user correction
- degraded operation and missing configuration
- end-to-end business capability checks
- safety and adversarial behavior tests
- CI tests that express user journeys instead of expected strings
If Agno Evals are closer to “measure this response,” Scenario is closer to “act out this use case and see whether the agent can finish the job.”
Direct Comparison
| Dimension | Agno Evals | LangWatch Scenario |
|---|---|---|
| Primary purpose | Measure Agno Agent or Team quality | Test full agent behavior through simulations |
| Best unit of testing | Input/output, run output, tool call, Agent, Team | User journey, conversation, scenario, business task |
| Framework coupling | Agno-native | Framework-agnostic, with Agno adapter support |
| Main abstraction | Eval classes such as AccuracyEval and ReliabilityEval | scenario.run() with AgentAdapter, simulator, judge, script |
| Multi-turn behavior | Possible through the agent runtime, but not the central abstraction | Core design goal |
| Expected output checks | Strong | Possible, but usually expressed as criteria or custom assertions |
| Tool-call checks | Strong through reliability evals | Strong through scripts, custom assertions, and judge criteria |
| Performance measurement | Native dimension | Not the primary focus |
| Production runtime fit | Natural with AgentOS and database-backed eval runs | Natural in CI and simulation visualization |
| Stakeholder readability | Good for metrics and score history | Very good for “can the agent do this real thing?” |
| Failure output | Scores, pass/fail, summaries, stored eval runs | Full conversation, judge reasoning, pass/fail, simulation visualization |
The short version:
Agno Evals are better for measuring known properties.
Scenario is better for discovering behavioral failures.
The Testing Pyramid Lens
LangWatch’s Scenario docs describe an agent testing pyramid with three layers:
- Unit tests for deterministic software pieces.
- Evals and optimization for probabilistic components.
- Simulations for complete agent behavior.
Agno Evals fit mostly in the middle layer. They measure probabilistic components and agent outputs. Agno reliability evals also touch the boundary between unit/integration checks and behavior checks because tool-call expectations are concrete.
Scenario fits at the top layer. It validates whether the integrated system works from the user’s point of view.
That gives us a practical architecture:
AGENT QUALITY STACK
Unit tests
- tool functions
- database queries
- policy parsers
- schema validation
Agno Evals
- answer accuracy
- rubric scoring
- expected tool calls
- Agent and Team performance
LangWatch Scenario
- multi-turn user journeys
- ambiguous requests
- edge cases
- release-blocking behavior tests
Production monitoring
- traces
- online evals
- alerts
- failed traces promoted into new tests
No single layer does the whole job. That is not a weakness. It is what mature test strategy looks like.
A Real Example: Refund Support Agent
Imagine an Agno agent that handles refund requests. It has tools for:
- looking up an order
- checking refund eligibility
- opening a ticket
- escalating to a human
- sending a customer-facing reply
The product requirement is:
The agent should help eligible customers get refunds,
but it must not promise money back before checking policy,
and it must escalate unclear or high-risk cases.
What I Would Test With Agno Evals
Use Agno Accuracy Evals for known cases:
| Input | Expected |
|---|---|
| ”I was charged twice for order A123” | classify as billing/refund |
| ”My package arrived damaged” | classify as damaged item |
| ”Can I get a refund after 90 days?“ | explain policy or escalate |
Use Agent-as-Judge Evals for response quality:
- clear explanation
- no unsupported promises
- polite tone
- policy-grounded language
- no invented refund terms
Use Reliability Evals for expected tools:
| Scenario | Expected Tool Calls |
|---|---|
| user gives order number | get_order_status, check_refund_policy |
| unclear refund request | clarification before tool action |
| high-risk billing dispute | escalate_to_human |
Use Performance Evals for release safety:
- p95 response time
- memory impact
- model or prompt comparison
- Team routing overhead
These are targeted measurements. They are compact, repeatable, and close to the Agno implementation.
What I Would Test With Scenario
Use Scenario for full journeys:
| Scenario | What It Proves |
|---|---|
| Frustrated customer says “you charged me twice” but does not provide order details | Agent asks for needed account/order info without making promises |
| Customer changes from refund request to replacement request mid-conversation | Agent adapts instead of forcing the original path |
| Tool says the order is outside refund window | Agent explains policy and offers escalation without hallucinating exceptions |
| User pushes for a refund before verification | Agent maintains policy boundary |
| User provides partial information across multiple turns | Agent tracks context and avoids asking the same thing repeatedly |
These are behavior rehearsals. They tell you whether the agent can actually carry the conversation.
How To Combine Them In CI
A practical workflow looks like this:
Pull request opens
|
v
Run unit tests
|
v
Run fast Agno Evals
- accuracy smoke set
- key reliability checks
- one performance guardrail
|
v
Run Scenario smoke simulations
- top happy path
- top ambiguity path
- top safety boundary path
|
v
Block merge only on high-signal failures
For nightly or pre-release runs:
Nightly job
|
v
Run full Agno eval suite
|
v
Run broader Scenario pack
|
v
Export failed traces and failed conversations
|
v
Promote new cases into eval datasets or scenario tests
The trick is to keep pull request checks fast. Run enough to catch obvious regressions, then let heavier simulation packs run on a schedule.
The Decision Guide
Use Agno Evals when:
- you are already building in Agno
- the behavior can be expressed as input plus expected output
- you need model, prompt, or tool comparison inside the Agno runtime
- you need tool-call reliability checks
- you need performance and memory measurement
- you want eval runs stored with Agno database infrastructure or surfaced through AgentOS
Use LangWatch Scenario when:
- the failure happens across multiple turns
- the user journey matters more than one answer
- the agent must ask clarifying questions
- the agent must recover from confusion, tool failure, or user correction
- product stakeholders need to see “can it do the job?”
- you want CI tests that read like business scenarios
- you need simulation results in LangWatch for review and debugging
Use both when:
- the agent touches money, support, operations, legal, healthcare, security, deployment, or customer trust
- prompt changes are frequent
- model upgrades are frequent
- you need both scores and behavioral evidence
- you want failures to become durable regression tests
Common Mistakes
Mistake 1: Treating Evals As End-to-End Tests
An eval can say an answer is accurate. It does not automatically prove the agent can handle a messy conversation.
For example, an Agno eval might confirm that the agent answers “refunds are available within 30 days” correctly. Scenario might reveal that the agent fails when the user says, “I bought it a while ago, I think maybe last month, but I changed cards since then.”
That is not the same test.
Mistake 2: Treating Scenario As A Metrics System
Scenario can produce pass/fail results and judge reasoning, but it is not primarily a latency benchmark or component optimization harness.
If you want to compare model A vs model B across 200 classification cases, native evals are usually the cleaner place to start.
Mistake 3: Only Testing Happy Paths
The first Scenario pack should include at least:
- one normal completion path
- one ambiguity path
- one safety boundary path
- one tool failure or missing data path
- one user correction path
The first Agno eval suite should include at least:
- known-good expected outputs
- known-bad edge cases
- expected tool-call checks
- a rubric for free-form responses
- a simple performance budget
Mistake 4: Letting Failed Cases Disappear
When an Agno eval fails, turn the input into a durable case.
When a Scenario test fails, preserve the conversation pattern. Was the user ambiguous? Did the agent skip verification? Did it lose context after a tool call? That pattern is the real asset.
The Practical Recommendation
For an Agno production agent, I would start like this:
- Write unit tests for deterministic tools and data transforms.
- Add Agno Accuracy Evals for 10 to 20 golden cases.
- Add Agno Reliability Evals for the top tool-call expectations.
- Add one Agent-as-Judge Eval for response quality.
- Add one small Performance Eval so latency regressions are visible.
- Add 3 to 5 LangWatch Scenario tests for the highest-value user journeys.
- Run the small set in CI and the larger set nightly.
- Promote production failures into Agno eval cases or Scenario simulations depending on their shape.
The placement rule is simple:
If the failure is about a measurable response property, put it in Agno Evals.
If the failure is about a conversation path, put it in Scenario.
Final Take
Agno Evals and LangWatch Scenario are complementary tools in the same quality system.
Agno Evals give you framework-native measurement: accuracy, rubric quality, performance, reliability, Agent behavior, Team behavior, and AgentOS visibility. They are the probes you attach to the machine.
LangWatch Scenario gives you simulation-based behavioral testing: realistic users, judges, scripts, multi-turn flows, and agent-framework adapters. It is the rehearsal before the machine meets the public.
If you are shipping agents that matter, you want both:
Agno Evals for measurable quality.
LangWatch Scenario for lived behavior.
Production traces to keep both honest.
That is the difference between asking, “Did the model answer this example?” and asking, “Can the agent actually do the job?”