AI Quality & Evaluation

Agno Evals vs LangWatch Scenario: Native Agent Metrics or Simulated Agent Tests?

A source-backed comparison of Agno's latest native eval docs and LangWatch Scenario's simulation-based testing model, with practical guidance on when to use each and how to combine them.

15 min read Updated Apr 28, 2026

As of April 28, 2026, the official Agno docs describe Evals as a native way to measure the quality of Agno Agents and Teams across four dimensions: accuracy, agent-as-judge, performance, and reliability. The official LangWatch Scenario docs describe Scenario as a simulation-based agent testing framework that tests real agent behavior through simulated users, judges, scripts, and multi-turn conversations.

That difference matters. These tools are not duplicates. They answer different quality questions.

Agno Evals ask: did this Agno Agent or Team produce the expected result, meet a rubric, stay performant, or call the expected tools?

LangWatch Scenario asks: can this agent survive a realistic situation over multiple turns, with a user who may clarify, change direction, provide missing details, or expose an edge case?

The best production answer is usually not “pick one.” It is:

Use Agno Evals to measure agent components and framework-local behavior.
Use LangWatch Scenario to rehearse end-to-end user journeys before release.
Promote failures from either layer into repeatable tests.

TL;DR

  • Agno Evals are now a first-class Agno production feature for scoring Agents and Teams.
  • Agno’s documented eval dimensions are accuracy, agent-as-judge, performance, and reliability.
  • LangWatch Scenario is simulation-first. It tests complete agent behavior with an agent under test, a user simulator, and a judge.
  • Agno Evals are strongest when the target is a known input/output, rubric, tool-call expectation, latency profile, memory footprint, or Team behavior.
  • LangWatch Scenario is strongest when the target is a user journey, ambiguity handling, clarification flow, tool recovery, or multi-turn regression.
  • The clean mental model: Agno Evals are measurement probes; Scenario tests are behavioral rehearsals.

The Evidence From The Docs

Here is the source trail I used for this comparison:

ClaimEvidence
Agno Evals measure Agents and Teams across multiple dimensionsAgno Evals overview
Agno’s documented dimensions are accuracy, agent-as-judge, performance, and reliabilityAgno Evals overview and Agno eval examples overview
Accuracy evals compare actual responses against expected outputs using an evaluator modelAgno Accuracy Evals
Agent-as-judge evals score custom criteria such as clarity, factual accuracy, tone, or friendlinessAgno Agent as Judge Evals
Reliability evals validate expected tool calls for Agents and TeamsAgno Reliability Evals
Agno eval runs can be stored with a database and exposed through AgentOS eval routesAgno Accuracy Evals: Track Evals in your AgentOS
Scenario is simulation-based and framework-agnosticScenario introduction
Scenario uses an AgentAdapter, UserSimulatorAgent, JudgeAgent, and optional scriptScenario getting started
Scenario integrates with Agno through the AgentAdapter interface and needs session-aware message handlingScenario Agno integration
Scenario’s testing pyramid separates unit tests, evals and optimization, and simulationsScenario Agent Testing Pyramid

My interpretation: Agno has moved evaluation closer to the agent runtime itself. LangWatch Scenario has moved testing closer to real user behavior. That is a useful split.

What Agno Evals Are Optimized For

Agno Evals live inside the same world as the Agno Agent or Team. The examples use Agno classes such as AccuracyEval, AgentAsJudgeEval, PerformanceEval, and ReliabilityEval, and the evaluated target can be an Agent, a Team, or a captured run output.

That makes them a natural fit for framework-local quality checks:

Evaluation NeedAgno Fit
”Did the agent answer the math question correctly?”Accuracy eval
”Did the answer meet a writing rubric?”Agent-as-judge eval
”Did this version get slower or use more memory?”Performance eval
”Did the agent call factorial when solving 10!?”Reliability eval
”Did the Team delegate to the right member/tool path?”Reliability eval on Team run output
”Can I store eval results alongside Agno runtime data?”Agno evals with database and AgentOS

The important point is that Agno Evals are close to the agent implementation. You can score the exact object you are building.

A Minimal Agno Eval Pattern

The current Agno docs show accuracy evals with a model, an agent, an input, an expected output, and optional guidelines. In practice, that pattern looks like this:

from agno.agent import Agent
from agno.eval.accuracy import AccuracyEval
from agno.models.openai import OpenAIResponses
from agno.tools.calculator import CalculatorTools

agent = Agent(
    model=OpenAIResponses(id="gpt-5.2"),
    tools=[CalculatorTools()],
)

evaluation = AccuracyEval(
    name="Calculator Evaluation",
    model=OpenAIResponses(id="gpt-5.2"),
    agent=agent,
    input="What is 10*5 then to the power of 2?",
    expected_output="2500",
    additional_guidelines="The answer should include the final result.",
    num_iterations=3,
)

result = evaluation.run(print_results=True)
assert result is not None and result.avg_score >= 8

That is a good shape when you already know the question and the expected answer.

Where Agno Evals Shine

Agno Evals are especially good for:

  • scoring single-turn or bounded tasks
  • comparing model choices inside an Agno Agent
  • checking structured output quality
  • validating that expected tools were called
  • measuring latency and memory after prompt or tool changes
  • storing eval runs in the same database-backed production system
  • exposing evaluation history through AgentOS

If I am building an Agno support router, I want native evals for things like:

  • intent classification accuracy
  • escalation decision correctness
  • expected tool-call checks
  • response rubric scoring
  • Team routing behavior
  • latency budget regressions

These are specific probes. They tell me whether a known component behaves well under known pressure.

What LangWatch Scenario Is Optimized For

LangWatch Scenario sits one level higher. It is not asking only “was this answer correct?” It asks whether the full agent can handle a situation.

The docs describe a test structure with:

  • an AgentAdapter that connects your real agent
  • a UserSimulatorAgent that role-plays the user
  • a JudgeAgent that evaluates the conversation against criteria
  • an optional script that controls the turn sequence
  • set_id and batch identifiers for grouping results
  • caching and debug modes to make non-deterministic tests easier to repeat
  • LangWatch visualization for simulation runs

That shape is much closer to end-to-end testing than to a scorecard.

A Minimal Scenario Pattern

Scenario’s getting-started docs show a pattern like this:

import pytest
import scenario

scenario.configure(default_model="openai/gpt-4.1-mini")

@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_billing_dispute_flow():
    class SupportAgent(scenario.AgentAdapter):
        async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes:
            return support_agent.run(
                input.last_new_user_message_str(),
                session_id=input.thread_id,
            ).content

    result = await scenario.run(
        name="billing dispute needs account lookup",
        description="""
            User believes they were charged twice.
            They are frustrated but cooperative.
            They want a refund or a clear explanation.
        """,
        agents=[
            SupportAgent(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(criteria=[
                "Agent asks for enough account information before taking action",
                "Agent explains billing policy clearly",
                "Agent does not promise a refund before verification",
                "Agent escalates if the case cannot be resolved safely",
            ]),
        ],
        max_turns=8,
    )

    assert result.success

This test can fail even when the final answer looks good. For example, it can catch that the agent promised a refund too early, skipped verification, or kept asking redundant questions.

Where Scenario Shines

Scenario is strongest for:

  • multi-turn agent behavior
  • ambiguous user requests
  • clarification and confirmation flows
  • realistic edge cases
  • recovery after user correction
  • degraded operation and missing configuration
  • end-to-end business capability checks
  • safety and adversarial behavior tests
  • CI tests that express user journeys instead of expected strings

If Agno Evals are closer to “measure this response,” Scenario is closer to “act out this use case and see whether the agent can finish the job.”

Direct Comparison

DimensionAgno EvalsLangWatch Scenario
Primary purposeMeasure Agno Agent or Team qualityTest full agent behavior through simulations
Best unit of testingInput/output, run output, tool call, Agent, TeamUser journey, conversation, scenario, business task
Framework couplingAgno-nativeFramework-agnostic, with Agno adapter support
Main abstractionEval classes such as AccuracyEval and ReliabilityEvalscenario.run() with AgentAdapter, simulator, judge, script
Multi-turn behaviorPossible through the agent runtime, but not the central abstractionCore design goal
Expected output checksStrongPossible, but usually expressed as criteria or custom assertions
Tool-call checksStrong through reliability evalsStrong through scripts, custom assertions, and judge criteria
Performance measurementNative dimensionNot the primary focus
Production runtime fitNatural with AgentOS and database-backed eval runsNatural in CI and simulation visualization
Stakeholder readabilityGood for metrics and score historyVery good for “can the agent do this real thing?”
Failure outputScores, pass/fail, summaries, stored eval runsFull conversation, judge reasoning, pass/fail, simulation visualization

The short version:

Agno Evals are better for measuring known properties.
Scenario is better for discovering behavioral failures.

The Testing Pyramid Lens

LangWatch’s Scenario docs describe an agent testing pyramid with three layers:

  1. Unit tests for deterministic software pieces.
  2. Evals and optimization for probabilistic components.
  3. Simulations for complete agent behavior.

Agno Evals fit mostly in the middle layer. They measure probabilistic components and agent outputs. Agno reliability evals also touch the boundary between unit/integration checks and behavior checks because tool-call expectations are concrete.

Scenario fits at the top layer. It validates whether the integrated system works from the user’s point of view.

That gives us a practical architecture:

                 AGENT QUALITY STACK

  Unit tests
  - tool functions
  - database queries
  - policy parsers
  - schema validation

  Agno Evals
  - answer accuracy
  - rubric scoring
  - expected tool calls
  - Agent and Team performance

  LangWatch Scenario
  - multi-turn user journeys
  - ambiguous requests
  - edge cases
  - release-blocking behavior tests

  Production monitoring
  - traces
  - online evals
  - alerts
  - failed traces promoted into new tests

No single layer does the whole job. That is not a weakness. It is what mature test strategy looks like.

A Real Example: Refund Support Agent

Imagine an Agno agent that handles refund requests. It has tools for:

  • looking up an order
  • checking refund eligibility
  • opening a ticket
  • escalating to a human
  • sending a customer-facing reply

The product requirement is:

The agent should help eligible customers get refunds,
but it must not promise money back before checking policy,
and it must escalate unclear or high-risk cases.

What I Would Test With Agno Evals

Use Agno Accuracy Evals for known cases:

InputExpected
”I was charged twice for order A123”classify as billing/refund
”My package arrived damaged”classify as damaged item
”Can I get a refund after 90 days?“explain policy or escalate

Use Agent-as-Judge Evals for response quality:

  • clear explanation
  • no unsupported promises
  • polite tone
  • policy-grounded language
  • no invented refund terms

Use Reliability Evals for expected tools:

ScenarioExpected Tool Calls
user gives order numberget_order_status, check_refund_policy
unclear refund requestclarification before tool action
high-risk billing disputeescalate_to_human

Use Performance Evals for release safety:

  • p95 response time
  • memory impact
  • model or prompt comparison
  • Team routing overhead

These are targeted measurements. They are compact, repeatable, and close to the Agno implementation.

What I Would Test With Scenario

Use Scenario for full journeys:

ScenarioWhat It Proves
Frustrated customer says “you charged me twice” but does not provide order detailsAgent asks for needed account/order info without making promises
Customer changes from refund request to replacement request mid-conversationAgent adapts instead of forcing the original path
Tool says the order is outside refund windowAgent explains policy and offers escalation without hallucinating exceptions
User pushes for a refund before verificationAgent maintains policy boundary
User provides partial information across multiple turnsAgent tracks context and avoids asking the same thing repeatedly

These are behavior rehearsals. They tell you whether the agent can actually carry the conversation.

How To Combine Them In CI

A practical workflow looks like this:

Pull request opens
    |
    v
Run unit tests
    |
    v
Run fast Agno Evals
  - accuracy smoke set
  - key reliability checks
  - one performance guardrail
    |
    v
Run Scenario smoke simulations
  - top happy path
  - top ambiguity path
  - top safety boundary path
    |
    v
Block merge only on high-signal failures

For nightly or pre-release runs:

Nightly job
    |
    v
Run full Agno eval suite
    |
    v
Run broader Scenario pack
    |
    v
Export failed traces and failed conversations
    |
    v
Promote new cases into eval datasets or scenario tests

The trick is to keep pull request checks fast. Run enough to catch obvious regressions, then let heavier simulation packs run on a schedule.

The Decision Guide

Use Agno Evals when:

  • you are already building in Agno
  • the behavior can be expressed as input plus expected output
  • you need model, prompt, or tool comparison inside the Agno runtime
  • you need tool-call reliability checks
  • you need performance and memory measurement
  • you want eval runs stored with Agno database infrastructure or surfaced through AgentOS

Use LangWatch Scenario when:

  • the failure happens across multiple turns
  • the user journey matters more than one answer
  • the agent must ask clarifying questions
  • the agent must recover from confusion, tool failure, or user correction
  • product stakeholders need to see “can it do the job?”
  • you want CI tests that read like business scenarios
  • you need simulation results in LangWatch for review and debugging

Use both when:

  • the agent touches money, support, operations, legal, healthcare, security, deployment, or customer trust
  • prompt changes are frequent
  • model upgrades are frequent
  • you need both scores and behavioral evidence
  • you want failures to become durable regression tests

Common Mistakes

Mistake 1: Treating Evals As End-to-End Tests

An eval can say an answer is accurate. It does not automatically prove the agent can handle a messy conversation.

For example, an Agno eval might confirm that the agent answers “refunds are available within 30 days” correctly. Scenario might reveal that the agent fails when the user says, “I bought it a while ago, I think maybe last month, but I changed cards since then.”

That is not the same test.

Mistake 2: Treating Scenario As A Metrics System

Scenario can produce pass/fail results and judge reasoning, but it is not primarily a latency benchmark or component optimization harness.

If you want to compare model A vs model B across 200 classification cases, native evals are usually the cleaner place to start.

Mistake 3: Only Testing Happy Paths

The first Scenario pack should include at least:

  • one normal completion path
  • one ambiguity path
  • one safety boundary path
  • one tool failure or missing data path
  • one user correction path

The first Agno eval suite should include at least:

  • known-good expected outputs
  • known-bad edge cases
  • expected tool-call checks
  • a rubric for free-form responses
  • a simple performance budget

Mistake 4: Letting Failed Cases Disappear

When an Agno eval fails, turn the input into a durable case.

When a Scenario test fails, preserve the conversation pattern. Was the user ambiguous? Did the agent skip verification? Did it lose context after a tool call? That pattern is the real asset.

The Practical Recommendation

For an Agno production agent, I would start like this:

  1. Write unit tests for deterministic tools and data transforms.
  2. Add Agno Accuracy Evals for 10 to 20 golden cases.
  3. Add Agno Reliability Evals for the top tool-call expectations.
  4. Add one Agent-as-Judge Eval for response quality.
  5. Add one small Performance Eval so latency regressions are visible.
  6. Add 3 to 5 LangWatch Scenario tests for the highest-value user journeys.
  7. Run the small set in CI and the larger set nightly.
  8. Promote production failures into Agno eval cases or Scenario simulations depending on their shape.

The placement rule is simple:

If the failure is about a measurable response property, put it in Agno Evals.
If the failure is about a conversation path, put it in Scenario.

Final Take

Agno Evals and LangWatch Scenario are complementary tools in the same quality system.

Agno Evals give you framework-native measurement: accuracy, rubric quality, performance, reliability, Agent behavior, Team behavior, and AgentOS visibility. They are the probes you attach to the machine.

LangWatch Scenario gives you simulation-based behavioral testing: realistic users, judges, scripts, multi-turn flows, and agent-framework adapters. It is the rehearsal before the machine meets the public.

If you are shipping agents that matter, you want both:

Agno Evals for measurable quality.
LangWatch Scenario for lived behavior.
Production traces to keep both honest.

That is the difference between asking, “Did the model answer this example?” and asking, “Can the agent actually do the job?”