Agno Evals vs LangWatch Scenario for AI Agent Testing

As of April 28, 2026, the official Agno docs describe Evals as a native way to measure the quality of Agno Agents and Teams across four dimensions: accuracy, agent-as-judge, performance, and reliability. The official LangWatch Scenario docs describe Scenario as a simulation-based agent testing framework that tests real agent behavior through simulated users, judges, scripts, and multi-turn conversations.

That difference matters. These tools are not duplicates. They answer different quality questions.

Agno Evals ask: did this Agno Agent or Team produce the expected result, meet a rubric, stay performant, or call the expected tools?

LangWatch Scenario asks: can this agent survive a realistic situation over multiple turns, with a user who may clarify, change direction, provide missing details, or expose an edge case?

The best production answer is usually not “pick one.” It is:

Use Agno Evals to measure agent components and framework-local behavior.
Use LangWatch Scenario to rehearse end-to-end user journeys before release.
Promote failures from either layer into repeatable tests.

TL;DR

Agno Evals are now a first-class Agno production feature for scoring Agents and Teams.
Agno’s documented eval dimensions are accuracy, agent-as-judge, performance, and reliability.
LangWatch Scenario is simulation-first. It tests complete agent behavior with an agent under test, a user simulator, and a judge.
Agno Evals are strongest when the target is a known input/output, rubric, tool-call expectation, latency profile, memory footprint, or Team behavior.
LangWatch Scenario is strongest when the target is a user journey, ambiguity handling, clarification flow, tool recovery, or multi-turn regression.
The clean mental model: Agno Evals are measurement probes; Scenario tests are behavioral rehearsals.

The Evidence From The Docs

Here is the source trail I used for this comparison:

Claim	Evidence
Agno Evals measure Agents and Teams across multiple dimensions	Agno Evals overview
Agno’s documented dimensions are accuracy, agent-as-judge, performance, and reliability	Agno Evals overview and Agno eval examples overview
Accuracy evals compare actual responses against expected outputs using an evaluator model	Agno Accuracy Evals
Agent-as-judge evals score custom criteria such as clarity, factual accuracy, tone, or friendliness	Agno Agent as Judge Evals
Reliability evals validate expected tool calls for Agents and Teams	Agno Reliability Evals
Agno eval runs can be stored with a database and exposed through AgentOS eval routes	Agno Accuracy Evals: Track Evals in your AgentOS
Scenario is simulation-based and framework-agnostic	Scenario introduction
Scenario uses an AgentAdapter, UserSimulatorAgent, JudgeAgent, and optional script	Scenario getting started
Scenario integrates with Agno through the AgentAdapter interface and needs session-aware message handling	Scenario Agno integration
Scenario’s testing pyramid separates unit tests, evals and optimization, and simulations	Scenario Agent Testing Pyramid

My interpretation: Agno has moved evaluation closer to the agent runtime itself. LangWatch Scenario has moved testing closer to real user behavior. That is a useful split.

What Agno Evals Are Optimized For

Agno Evals live inside the same world as the Agno Agent or Team. The examples use Agno classes such as AccuracyEval, AgentAsJudgeEval, PerformanceEval, and ReliabilityEval, and the evaluated target can be an Agent, a Team, or a captured run output.

That makes them a natural fit for framework-local quality checks:

Evaluation Need	Agno Fit
”Did the agent answer the math question correctly?”	Accuracy eval
”Did the answer meet a writing rubric?”	Agent-as-judge eval
”Did this version get slower or use more memory?”	Performance eval
”Did the agent call `factorial` when solving 10!?”	Reliability eval
”Did the Team delegate to the right member/tool path?”	Reliability eval on Team run output
”Can I store eval results alongside Agno runtime data?”	Agno evals with database and AgentOS

The important point is that Agno Evals are close to the agent implementation. You can score the exact object you are building.

A Minimal Agno Eval Pattern

The current Agno docs show accuracy evals with a model, an agent, an input, an expected output, and optional guidelines. In practice, that pattern looks like this:

from agno.agent import Agent
from agno.eval.accuracy import AccuracyEval
from agno.models.openai import OpenAIResponses
from agno.tools.calculator import CalculatorTools

agent = Agent(
    model=OpenAIResponses(id="gpt-5.2"),
    tools=[CalculatorTools()],
)

evaluation = AccuracyEval(
    name="Calculator Evaluation",
    model=OpenAIResponses(id="gpt-5.2"),
    agent=agent,
    input="What is 10*5 then to the power of 2?",
    expected_output="2500",
    additional_guidelines="The answer should include the final result.",
    num_iterations=3,
)

result = evaluation.run(print_results=True)
assert result is not None and result.avg_score >= 8

That is a good shape when you already know the question and the expected answer.

Where Agno Evals Shine

Agno Evals are especially good for:

scoring single-turn or bounded tasks
comparing model choices inside an Agno Agent
checking structured output quality
validating that expected tools were called
measuring latency and memory after prompt or tool changes
storing eval runs in the same database-backed production system
exposing evaluation history through AgentOS

If I am building an Agno support router, I want native evals for things like:

intent classification accuracy
escalation decision correctness
expected tool-call checks
response rubric scoring
Team routing behavior
latency budget regressions

These are specific probes. They tell me whether a known component behaves well under known pressure.

What LangWatch Scenario Is Optimized For

LangWatch Scenario sits one level higher. It is not asking only “was this answer correct?” It asks whether the full agent can handle a situation.

The docs describe a test structure with:

an AgentAdapter that connects your real agent
a UserSimulatorAgent that role-plays the user
a JudgeAgent that evaluates the conversation against criteria
an optional script that controls the turn sequence
set_id and batch identifiers for grouping results
caching and debug modes to make non-deterministic tests easier to repeat
LangWatch visualization for simulation runs

That shape is much closer to end-to-end testing than to a scorecard.

A Minimal Scenario Pattern

Scenario’s getting-started docs show a pattern like this:

import pytest
import scenario

scenario.configure(default_model="openai/gpt-4.1-mini")

@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_billing_dispute_flow():
    class SupportAgent(scenario.AgentAdapter):
        async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
            return support_agent.run(
                input.last_new_user_message_str(),
                session_id=input.thread_id,
            ).content

    result = await scenario.run(
        name="billing dispute needs account lookup",
        description="""
            User believes they were charged twice.
            They are frustrated but cooperative.
            They want a refund or a clear explanation.
        """,
        agents=[
            SupportAgent(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(criteria=[
                "Agent asks for enough account information before taking action",
                "Agent explains billing policy clearly",
                "Agent does not promise a refund before verification",
                "Agent escalates if the case cannot be resolved safely",
            ]),
        ],
        max_turns=8,
    )

    assert result.success

This test can fail even when the final answer looks good. For example, it can catch that the agent promised a refund too early, skipped verification, or kept asking redundant questions.

Where Scenario Shines

Scenario is strongest for:

multi-turn agent behavior
ambiguous user requests
clarification and confirmation flows
realistic edge cases
recovery after user correction
degraded operation and missing configuration
end-to-end business capability checks
safety and adversarial behavior tests
CI tests that express user journeys instead of expected strings

If Agno Evals are closer to “measure this response,” Scenario is closer to “act out this use case and see whether the agent can finish the job.”

Direct Comparison

Dimension	Agno Evals	LangWatch Scenario
Primary purpose	Measure Agno Agent or Team quality	Test full agent behavior through simulations
Best unit of testing	Input/output, run output, tool call, Agent, Team	User journey, conversation, scenario, business task
Framework coupling	Agno-native	Framework-agnostic, with Agno adapter support
Main abstraction	Eval classes such as `AccuracyEval` and `ReliabilityEval`	`scenario.run()` with AgentAdapter, simulator, judge, script
Multi-turn behavior	Possible through the agent runtime, but not the central abstraction	Core design goal
Expected output checks	Strong	Possible, but usually expressed as criteria or custom assertions
Tool-call checks	Strong through reliability evals	Strong through scripts, custom assertions, and judge criteria
Performance measurement	Native dimension	Not the primary focus
Production runtime fit	Natural with AgentOS and database-backed eval runs	Natural in CI and simulation visualization
Stakeholder readability	Good for metrics and score history	Very good for “can the agent do this real thing?”
Failure output	Scores, pass/fail, summaries, stored eval runs	Full conversation, judge reasoning, pass/fail, simulation visualization

The short version:

Agno Evals are better for measuring known properties.
Scenario is better for discovering behavioral failures.

The Testing Pyramid Lens

LangWatch’s Scenario docs describe an agent testing pyramid with three layers:

Unit tests for deterministic software pieces.
Evals and optimization for probabilistic components.
Simulations for complete agent behavior.

Agno Evals fit mostly in the middle layer. They measure probabilistic components and agent outputs. Agno reliability evals also touch the boundary between unit/integration checks and behavior checks because tool-call expectations are concrete.

Scenario fits at the top layer. It validates whether the integrated system works from the user’s point of view.

That gives us a practical architecture:

                 AGENT QUALITY STACK

  Unit tests
  - tool functions
  - database queries
  - policy parsers
  - schema validation

  Agno Evals
  - answer accuracy
  - rubric scoring
  - expected tool calls
  - Agent and Team performance

  LangWatch Scenario
  - multi-turn user journeys
  - ambiguous requests
  - edge cases
  - release-blocking behavior tests

  Production monitoring
  - traces
  - online evals
  - alerts
  - failed traces promoted into new tests

No single layer does the whole job. That is not a weakness. It is what mature test strategy looks like.

A Real Example: Refund Support Agent

Imagine an Agno agent that handles refund requests. It has tools for:

looking up an order
checking refund eligibility
opening a ticket
escalating to a human
sending a customer-facing reply

The product requirement is:

The agent should help eligible customers get refunds,
but it must not promise money back before checking policy,
and it must escalate unclear or high-risk cases.

What I Would Test With Agno Evals

Use Agno Accuracy Evals for known cases:

Input	Expected
”I was charged twice for order A123”	classify as billing/refund
”My package arrived damaged”	classify as damaged item
”Can I get a refund after 90 days?“	explain policy or escalate

Use Agent-as-Judge Evals for response quality:

clear explanation
no unsupported promises
polite tone
policy-grounded language
no invented refund terms

Use Reliability Evals for expected tools:

Scenario	Expected Tool Calls
user gives order number	`get_order_status`, `check_refund_policy`
unclear refund request	clarification before tool action
high-risk billing dispute	`escalate_to_human`

Use Performance Evals for release safety:

p95 response time
memory impact
model or prompt comparison
Team routing overhead

These are targeted measurements. They are compact, repeatable, and close to the Agno implementation.

What I Would Test With Scenario

Use Scenario for full journeys:

Scenario	What It Proves
Frustrated customer says “you charged me twice” but does not provide order details	Agent asks for needed account/order info without making promises
Customer changes from refund request to replacement request mid-conversation	Agent adapts instead of forcing the original path
Tool says the order is outside refund window	Agent explains policy and offers escalation without hallucinating exceptions
User pushes for a refund before verification	Agent maintains policy boundary
User provides partial information across multiple turns	Agent tracks context and avoids asking the same thing repeatedly

These are behavior rehearsals. They tell you whether the agent can actually carry the conversation.

How To Combine Them In CI

A practical workflow looks like this:

Pull request opens
    |
    v
Run unit tests
    |
    v
Run fast Agno Evals
  - accuracy smoke set
  - key reliability checks
  - one performance guardrail
    |
    v
Run Scenario smoke simulations
  - top happy path
  - top ambiguity path
  - top safety boundary path
    |
    v
Block merge only on high-signal failures

For nightly or pre-release runs:

Nightly job
    |
    v
Run full Agno eval suite
    |
    v
Run broader Scenario pack
    |
    v
Export failed traces and failed conversations
    |
    v
Promote new cases into eval datasets or scenario tests

The trick is to keep pull request checks fast. Run enough to catch obvious regressions, then let heavier simulation packs run on a schedule.

The Decision Guide

Use Agno Evals when:

you are already building in Agno
the behavior can be expressed as input plus expected output
you need model, prompt, or tool comparison inside the Agno runtime
you need tool-call reliability checks
you need performance and memory measurement
you want eval runs stored with Agno database infrastructure or surfaced through AgentOS

Use LangWatch Scenario when:

the failure happens across multiple turns
the user journey matters more than one answer
the agent must ask clarifying questions
the agent must recover from confusion, tool failure, or user correction
product stakeholders need to see “can it do the job?”
you want CI tests that read like business scenarios
you need simulation results in LangWatch for review and debugging

Use both when:

the agent touches money, support, operations, legal, healthcare, security, deployment, or customer trust
prompt changes are frequent
model upgrades are frequent
you need both scores and behavioral evidence
you want failures to become durable regression tests

Common Mistakes

Mistake 1: Treating Evals As End-to-End Tests

An eval can say an answer is accurate. It does not automatically prove the agent can handle a messy conversation.

For example, an Agno eval might confirm that the agent answers “refunds are available within 30 days” correctly. Scenario might reveal that the agent fails when the user says, “I bought it a while ago, I think maybe last month, but I changed cards since then.”

That is not the same test.

Mistake 2: Treating Scenario As A Metrics System

Scenario can produce pass/fail results and judge reasoning, but it is not primarily a latency benchmark or component optimization harness.

If you want to compare model A vs model B across 200 classification cases, native evals are usually the cleaner place to start.

Mistake 3: Only Testing Happy Paths

The first Scenario pack should include at least:

one normal completion path
one ambiguity path
one safety boundary path
one tool failure or missing data path
one user correction path

The first Agno eval suite should include at least:

known-good expected outputs
known-bad edge cases
expected tool-call checks
a rubric for free-form responses
a simple performance budget

Mistake 4: Letting Failed Cases Disappear

When an Agno eval fails, turn the input into a durable case.

When a Scenario test fails, preserve the conversation pattern. Was the user ambiguous? Did the agent skip verification? Did it lose context after a tool call? That pattern is the real asset.

The Practical Recommendation

For an Agno production agent, I would start like this:

Write unit tests for deterministic tools and data transforms.
Add Agno Accuracy Evals for 10 to 20 golden cases.
Add Agno Reliability Evals for the top tool-call expectations.
Add one Agent-as-Judge Eval for response quality.
Add one small Performance Eval so latency regressions are visible.
Add 3 to 5 LangWatch Scenario tests for the highest-value user journeys.
Run the small set in CI and the larger set nightly.
Promote production failures into Agno eval cases or Scenario simulations depending on their shape.

The placement rule is simple:

If the failure is about a measurable response property, put it in Agno Evals.
If the failure is about a conversation path, put it in Scenario.

Final Take

Agno Evals and LangWatch Scenario are complementary tools in the same quality system.

Agno Evals give you framework-native measurement: accuracy, rubric quality, performance, reliability, Agent behavior, Team behavior, and AgentOS visibility. They are the probes you attach to the machine.

LangWatch Scenario gives you simulation-based behavioral testing: realistic users, judges, scripts, multi-turn flows, and agent-framework adapters. It is the rehearsal before the machine meets the public.

If you are shipping agents that matter, you want both:

Agno Evals for measurable quality.
LangWatch Scenario for lived behavior.
Production traces to keep both honest.

That is the difference between asking, “Did the model answer this example?” and asking, “Can the agent actually do the job?”

Luis Mori Guerra

Recent Articles

Topics

Agno Evals vs LangWatch Scenario: Native Agent Metrics or Simulated Agent Tests?