AI Quality & Evaluation

Trace Grading vs Scenario Testing: How to Evaluate Agents in Production

Why production agent evaluation is moving beyond output-only checks, how trace-aware grading complements scenario testing, and how LangWatch, LangSmith, and Langfuse compare.

12 min read Updated Mar 30, 2026

As of March 30, 2026, my read of the official docs for LangWatch, LangSmith, and Langfuse is that agent evaluation is expanding from final-answer scoring toward trace-aware grading: scoring traces, spans, tool calls, and trajectories, not just the last message. That sentence is my editorial inference from current product capabilities, not a standardized industry benchmark.

The reason is simple: agents are not just answer generators. They are decision systems. A final answer can look fine while the execution behind it is wasteful, brittle, unsafe, or one prompt edit away from breaking.

TL;DR

  • Output-only checks score the final answer. They are still useful, but they miss many agent failures.
  • Trace-aware grading scores the execution path: tool choice, tool arguments, retrieval quality, intermediate decisions, latency, cost, and handoff behavior.
  • Scenario testing is still essential because it rehearses multi-turn behavior before release.
  • The winning production pattern is usually both, not either-or:
    • Scenario tests for pre-release behavior
    • Trace grading for live production quality
  • LangWatch leans hardest into simulation-first agent testing.
  • LangSmith is very strong across offline evals, online evals, trajectory evaluation, and human review workflows.
  • Langfuse is especially strong if you want an open-source, self-hostable observability and scoring stack centered on traces and step-wise evaluations.

What You Will Learn Here

  • Why output-only evaluation breaks down for agents in production
  • What trace-aware grading actually means in practice
  • How scenario testing and trace grading fit together
  • A practical evaluation stack for teams with Engineers and PMs in the loop
  • How LangWatch, LangSmith, and Langfuse compare as of March 30, 2026
  • Where to start if your team is still evaluating agents mostly by vibes

Why It Matters Now

Three things are happening at once:

  • More teams are shipping agents that make intermediate decisions, not just final responses.
  • More quality regressions now show up as bad paths through a workflow, not obviously bad final text.
  • More evaluation platforms now expose ways to score traces, runs, spans, observations, and trajectories directly.

That combination changes the evaluation job. The question is no longer only “Was the answer good?” It is also “Was the execution trustworthy?”

Why Output-Only Evaluation Breaks Down

If you only grade the last answer, you miss the process that produced it.

Imagine a support agent handling a refund request:

  1. It retrieves the wrong policy document.
  2. It calls the refund tool before account verification.
  3. It retries the same tool twice because the prompt loop is unstable.
  4. It still ends with a polite answer that sounds plausible.

An output-only judge may score that final answer as “mostly correct.” Production, however, cares about more than style:

  • Did the agent choose the right tool?
  • Did it pass the right arguments?
  • Did it use the right context?
  • Did it take too many steps?
  • Did it expose a risky path even if the final answer looked fine?

That is the core reason trace-aware grading matters.

Trace Grading vs Scenario Testing

These are related ideas, but they solve different problems.

Scenario Testing

Scenario testing asks:

  • Can this agent handle a realistic multi-turn conversation?
  • Does it ask clarifying questions at the right time?
  • Does it recover from ambiguity, refusal, or tool failure?
  • Does the whole behavior still make sense after a model or prompt change?

This is rehearsal. It is closest to end-to-end behavioral testing.

Simulated user
    |
    v
Agent under test
    |
    v
Judge / rubric
    |
    +--> pass / fail / continue

Trace Grading

Trace grading asks:

  • For this real production trace, how good was each step?
  • Was the retrieval span relevant?
  • Was the tool-routing decision appropriate?
  • Did the final answer align with the path taken?
  • Are quality, cost, and latency drifting over time?

This is operational scoring. It is closest to production observability plus evaluation.

Live traffic
    |
    v
Trace
  +-- span: retrieval
  +-- span: tool choice
  +-- span: tool execution
  +-- span: final answer
    |
    v
Per-step graders + user feedback + dashboards
    |
    v
Scores, alerts, annotations, datasets

The Short Version

  • Scenario testing tells you whether the agent can behave well in a rehearsed situation.
  • Trace grading tells you whether the agent is behaving well in the wild.

For production teams, that is the difference between pre-release confidence and post-release control.

What “Trace-Aware” Really Means

Trace-aware grading means your evaluation target is not just the final string. It is the execution structure behind it.

For agents, that often means grading some combination of:

LayerExample Questions
Final answerWas the answer correct, helpful, safe, and on-brand?
Tool choiceDid the agent choose the right tool or workflow?
Tool argumentsWere the parameters complete and correct?
RetrievalDid it fetch the right documents or records?
TrajectoryDid the sequence of actions make sense?
Cost / latencyWas the path efficient enough for production?
Human feedbackDid users and reviewers agree with the automated scores?

This is especially important for agents because many failures are process failures before they become visible answer failures.

The Failure Modes Output-Only Checks Miss

Here are a few common misses:

Failure ModeWhy Output-Only Checks Miss ItWhy Trace Grading Catches It
Wrong tool, lucky answerFinal text still sounds rightTrace shows incorrect tool call
Excessive retriesFinal text may still be fineTrace shows inefficient loop
Weak retrievalJudge may reward polished paraphraseRetrieval span can be scored directly
Unsafe near-missFinal answer hides bad intermediate actionSpan-level grading exposes the risky step
Regressions after prompt editsFinal output varies subtlyTrace metrics show routing drift and extra steps

This is why I would not frame trace grading as a replacement for scenario testing. I would frame it as the missing production layer that output-only checks cannot cover on their own.

A Practical Evaluation Stack for Production Agents

The most useful stack I have seen is:

                 PRODUCTION AGENT QUALITY LOOP

  Unit tests
      |
      v
  Dataset evals on components
      |
      v
  Scenario tests on full agent behavior
      |
      v
  Release
      |
      v
  Production traces
      |
      +--> trace grading
      +--> alerts
      +--> human annotation
      |
      v
  Promote bad traces into datasets and scenarios
      |
      v
  Re-run evals before next release

In plain English:

  • Unit tests protect deterministic code.
  • Dataset evals score components and prompts.
  • Scenario tests rehearse agent behavior before release.
  • Trace grading monitors real behavior after release.
  • Human review calibrates the automated graders.
  • Bad production traces become tomorrow’s offline tests.

That loop is where evaluation starts becoming an engineering system instead of a demo ritual.

Code Example: Grade the Trace, Not Just the Answer

Here is the simplest mental model in Python-like pseudocode:

def handle_refund_request(user_input: str):
    trace = start_trace(name="refund-agent")

    retrieval = start_span(trace, name="policy_lookup")
    policy_docs = search_policy_docs(user_input)
    retrieval.score(
        name="retrieval_relevance",
        value=judge_retrieval(user_input, policy_docs),
    )

    planning = start_span(trace, name="tool_routing")
    action = choose_next_action(user_input, policy_docs)
    planning.score(
        name="tool_choice_correctness",
        value=judge_tool_choice(user_input, action),
    )

    execution = start_span(trace, name="tool_execution")
    result = run_action(action)
    execution.score(
        name="tool_argument_quality",
        value=validate_tool_args(action),
    )

    response = start_span(trace, name="final_response")
    answer = draft_user_response(user_input, result)
    response.score(
        name="answer_helpfulness",
        value=judge_answer(user_input, answer),
    )

    trace.score(name="workflow_efficiency", value=judge_efficiency(trace))
    return answer

The important design choice is this:

  • Do not wait until the final answer to score quality.
  • Score the steps that matter to your business risk.

In practice, the three platforms map to this pattern in different ways:

  • LangWatch supports attaching evaluations to traces and spans, plus running experiments and agent simulations.
  • LangSmith supports offline evaluators, online evaluators on traces, and agent trajectory evaluations.
  • Langfuse supports scores on traces and observations, managed LLM-as-a-judge on production or development traces, and experiments on datasets.

Scenario Testing Still Matters More Than Many Teams Think

A trap I see often is this:

  • Team adds traces
  • Team adds a judge on the final answer
  • Team declares evaluation solved

But production agents often fail in conversation, not just in isolated responses.

Examples:

  • The agent should ask a clarifying question before booking a flight.
  • The agent should refuse to call a destructive tool without confirmation.
  • The agent should recover when a tool returns partial data.
  • The agent should preserve context across multiple turns without repeating work.

Those are scenario problems. They are hard to reduce to single input-output pairs.

This is where LangWatch’s Scenario tooling is especially opinionated: it treats agent reliability as something you should rehearse through realistic interactions, not just score after the fact. LangSmith also supports strong agent evaluation through trajectory evaluators and single-step/final-response evaluators, but its framing in the docs is more evaluation-lifecycle-centric than simulation-first. Langfuse, based on the docs I reviewed, is strongest on trace-centric scoring and experiments rather than on a first-party scenario simulation workflow.

That last comparison sentence is also my editorial synthesis from the docs, not a vendor claim.

LangWatch vs LangSmith vs Langfuse

The best choice depends on which evaluation problem hurts you most.

ToolWhere It Feels StrongestTrace-Aware GradingScenario / Trajectory TestingHuman ReviewBest Fit
LangWatchOne stack for observability, evals, and simulation-heavy agent testingStrong trace/span evaluation model, online evals, alertsStrongest scenario-first story of the threeAnnotations supportedTeams that want simulation plus production grading in one workflow
LangSmithMature evaluation lifecycle from offline datasets to online evaluatorsStrong online evaluators on traces and rich experiment workflowsStrong trajectory evaluation; good for agent path checksStrong annotation queue workflowTeams already using LangChain/LangGraph or wanting robust eval operations
LangfuseOpen-source observability, prompt management, and scoring around real tracesStrong scores on traces/observations and managed LLM-as-judge on production tracesGood offline experiments, but less scenario-centric in docs reviewedAnnotation queues supportedTeams that want open-source, self-hostable trace-centric evaluation

My Practical Read

  • Pick LangWatch if the missing piece is “we need better ways to rehearse and debug full agent behavior.”
  • Pick LangSmith if the missing piece is “we need a disciplined evaluation system across offline, online, trajectories, and human review.”
  • Pick Langfuse if the missing piece is “we want deep production tracing plus flexible scoring in an open-source stack.”

If I were advising a team from scratch, I would not ask only “Which tool has evals?” I would ask:

  • Do we need simulations most?
  • Do we need trajectory and evaluator operations most?
  • Do we need trace-native, open-source production observability most?

That question usually makes the choice much clearer.

Where PMs and Engineers Should Align

This topic is not just for platform engineers.

PMs care about:

  • Whether the agent completes the workflow users expect
  • Whether regressions show up before support tickets pile up
  • Whether quality can be explained in dashboards, not anecdotes

Engineers care about:

  • Whether the tool-routing logic regressed
  • Whether prompt changes increased retries or cost
  • Whether low scores can be traced back to a specific step

Trace-aware grading gives both groups a better shared object:

  • PMs get quality trends tied to user journeys
  • Engineers get the exact spans, steps, and rubrics behind those scores

That is a much healthier operating model than arguing about one cherry-picked screenshot from staging.

Where to Start on Monday

If your team is early, do this:

  1. Pick one production workflow that matters, like refund handling or lead qualification.
  2. Define three or four span-level rubrics:
    • retrieval relevance
    • tool choice correctness
    • final answer helpfulness
    • workflow efficiency
  3. Add human review for low-score traces.
  4. Turn the worst real traces into offline datasets and scenario tests.
  5. Only then expand to more metrics.

If you try to evaluate everything at once, you will build dashboards nobody trusts.

If you start with one workflow and a small rubric, you can actually improve the system.

Final Take

Output-only evaluation made sense when many LLM apps were basically smart text functions.

Agents changed the game. Once a system can plan, call tools, branch, retry, and maintain state, the final answer stops being the whole story. The execution path becomes part of the product.

That is why I expect more production teams to treat trace grading as a standard layer of agent quality, while keeping scenario testing as the rehearsal layer that protects releases.

The real choice is not trace grading or scenario testing.

The real production move is:

  • use scenario testing to practice behavior before release
  • use trace-aware grading to control behavior after release

Sources