As of March 30, 2026, my read of the official docs for LangWatch, LangSmith, and Langfuse is that agent evaluation is expanding from final-answer scoring toward trace-aware grading: scoring traces, spans, tool calls, and trajectories, not just the last message. That sentence is my editorial inference from current product capabilities, not a standardized industry benchmark.
The reason is simple: agents are not just answer generators. They are decision systems. A final answer can look fine while the execution behind it is wasteful, brittle, unsafe, or one prompt edit away from breaking.
TL;DR
- Output-only checks score the final answer. They are still useful, but they miss many agent failures.
- Trace-aware grading scores the execution path: tool choice, tool arguments, retrieval quality, intermediate decisions, latency, cost, and handoff behavior.
- Scenario testing is still essential because it rehearses multi-turn behavior before release.
- The winning production pattern is usually both, not either-or:
- Scenario tests for pre-release behavior
- Trace grading for live production quality
- LangWatch leans hardest into simulation-first agent testing.
- LangSmith is very strong across offline evals, online evals, trajectory evaluation, and human review workflows.
- Langfuse is especially strong if you want an open-source, self-hostable observability and scoring stack centered on traces and step-wise evaluations.
What You Will Learn Here
- Why output-only evaluation breaks down for agents in production
- What trace-aware grading actually means in practice
- How scenario testing and trace grading fit together
- A practical evaluation stack for teams with Engineers and PMs in the loop
- How LangWatch, LangSmith, and Langfuse compare as of March 30, 2026
- Where to start if your team is still evaluating agents mostly by vibes
Why It Matters Now
Three things are happening at once:
- More teams are shipping agents that make intermediate decisions, not just final responses.
- More quality regressions now show up as bad paths through a workflow, not obviously bad final text.
- More evaluation platforms now expose ways to score traces, runs, spans, observations, and trajectories directly.
That combination changes the evaluation job. The question is no longer only “Was the answer good?” It is also “Was the execution trustworthy?”
Why Output-Only Evaluation Breaks Down
If you only grade the last answer, you miss the process that produced it.
Imagine a support agent handling a refund request:
- It retrieves the wrong policy document.
- It calls the refund tool before account verification.
- It retries the same tool twice because the prompt loop is unstable.
- It still ends with a polite answer that sounds plausible.
An output-only judge may score that final answer as “mostly correct.” Production, however, cares about more than style:
- Did the agent choose the right tool?
- Did it pass the right arguments?
- Did it use the right context?
- Did it take too many steps?
- Did it expose a risky path even if the final answer looked fine?
That is the core reason trace-aware grading matters.
Trace Grading vs Scenario Testing
These are related ideas, but they solve different problems.
Scenario Testing
Scenario testing asks:
- Can this agent handle a realistic multi-turn conversation?
- Does it ask clarifying questions at the right time?
- Does it recover from ambiguity, refusal, or tool failure?
- Does the whole behavior still make sense after a model or prompt change?
This is rehearsal. It is closest to end-to-end behavioral testing.
Simulated user
|
v
Agent under test
|
v
Judge / rubric
|
+--> pass / fail / continue
Trace Grading
Trace grading asks:
- For this real production trace, how good was each step?
- Was the retrieval span relevant?
- Was the tool-routing decision appropriate?
- Did the final answer align with the path taken?
- Are quality, cost, and latency drifting over time?
This is operational scoring. It is closest to production observability plus evaluation.
Live traffic
|
v
Trace
+-- span: retrieval
+-- span: tool choice
+-- span: tool execution
+-- span: final answer
|
v
Per-step graders + user feedback + dashboards
|
v
Scores, alerts, annotations, datasets
The Short Version
- Scenario testing tells you whether the agent can behave well in a rehearsed situation.
- Trace grading tells you whether the agent is behaving well in the wild.
For production teams, that is the difference between pre-release confidence and post-release control.
What “Trace-Aware” Really Means
Trace-aware grading means your evaluation target is not just the final string. It is the execution structure behind it.
For agents, that often means grading some combination of:
| Layer | Example Questions |
|---|---|
| Final answer | Was the answer correct, helpful, safe, and on-brand? |
| Tool choice | Did the agent choose the right tool or workflow? |
| Tool arguments | Were the parameters complete and correct? |
| Retrieval | Did it fetch the right documents or records? |
| Trajectory | Did the sequence of actions make sense? |
| Cost / latency | Was the path efficient enough for production? |
| Human feedback | Did users and reviewers agree with the automated scores? |
This is especially important for agents because many failures are process failures before they become visible answer failures.
The Failure Modes Output-Only Checks Miss
Here are a few common misses:
| Failure Mode | Why Output-Only Checks Miss It | Why Trace Grading Catches It |
|---|---|---|
| Wrong tool, lucky answer | Final text still sounds right | Trace shows incorrect tool call |
| Excessive retries | Final text may still be fine | Trace shows inefficient loop |
| Weak retrieval | Judge may reward polished paraphrase | Retrieval span can be scored directly |
| Unsafe near-miss | Final answer hides bad intermediate action | Span-level grading exposes the risky step |
| Regressions after prompt edits | Final output varies subtly | Trace metrics show routing drift and extra steps |
This is why I would not frame trace grading as a replacement for scenario testing. I would frame it as the missing production layer that output-only checks cannot cover on their own.
A Practical Evaluation Stack for Production Agents
The most useful stack I have seen is:
PRODUCTION AGENT QUALITY LOOP
Unit tests
|
v
Dataset evals on components
|
v
Scenario tests on full agent behavior
|
v
Release
|
v
Production traces
|
+--> trace grading
+--> alerts
+--> human annotation
|
v
Promote bad traces into datasets and scenarios
|
v
Re-run evals before next release
In plain English:
- Unit tests protect deterministic code.
- Dataset evals score components and prompts.
- Scenario tests rehearse agent behavior before release.
- Trace grading monitors real behavior after release.
- Human review calibrates the automated graders.
- Bad production traces become tomorrow’s offline tests.
That loop is where evaluation starts becoming an engineering system instead of a demo ritual.
Code Example: Grade the Trace, Not Just the Answer
Here is the simplest mental model in Python-like pseudocode:
def handle_refund_request(user_input: str):
trace = start_trace(name="refund-agent")
retrieval = start_span(trace, name="policy_lookup")
policy_docs = search_policy_docs(user_input)
retrieval.score(
name="retrieval_relevance",
value=judge_retrieval(user_input, policy_docs),
)
planning = start_span(trace, name="tool_routing")
action = choose_next_action(user_input, policy_docs)
planning.score(
name="tool_choice_correctness",
value=judge_tool_choice(user_input, action),
)
execution = start_span(trace, name="tool_execution")
result = run_action(action)
execution.score(
name="tool_argument_quality",
value=validate_tool_args(action),
)
response = start_span(trace, name="final_response")
answer = draft_user_response(user_input, result)
response.score(
name="answer_helpfulness",
value=judge_answer(user_input, answer),
)
trace.score(name="workflow_efficiency", value=judge_efficiency(trace))
return answer
The important design choice is this:
- Do not wait until the final answer to score quality.
- Score the steps that matter to your business risk.
In practice, the three platforms map to this pattern in different ways:
- LangWatch supports attaching evaluations to traces and spans, plus running experiments and agent simulations.
- LangSmith supports offline evaluators, online evaluators on traces, and agent trajectory evaluations.
- Langfuse supports scores on traces and observations, managed LLM-as-a-judge on production or development traces, and experiments on datasets.
Scenario Testing Still Matters More Than Many Teams Think
A trap I see often is this:
- Team adds traces
- Team adds a judge on the final answer
- Team declares evaluation solved
But production agents often fail in conversation, not just in isolated responses.
Examples:
- The agent should ask a clarifying question before booking a flight.
- The agent should refuse to call a destructive tool without confirmation.
- The agent should recover when a tool returns partial data.
- The agent should preserve context across multiple turns without repeating work.
Those are scenario problems. They are hard to reduce to single input-output pairs.
This is where LangWatch’s Scenario tooling is especially opinionated: it treats agent reliability as something you should rehearse through realistic interactions, not just score after the fact. LangSmith also supports strong agent evaluation through trajectory evaluators and single-step/final-response evaluators, but its framing in the docs is more evaluation-lifecycle-centric than simulation-first. Langfuse, based on the docs I reviewed, is strongest on trace-centric scoring and experiments rather than on a first-party scenario simulation workflow.
That last comparison sentence is also my editorial synthesis from the docs, not a vendor claim.
LangWatch vs LangSmith vs Langfuse
The best choice depends on which evaluation problem hurts you most.
| Tool | Where It Feels Strongest | Trace-Aware Grading | Scenario / Trajectory Testing | Human Review | Best Fit |
|---|---|---|---|---|---|
| LangWatch | One stack for observability, evals, and simulation-heavy agent testing | Strong trace/span evaluation model, online evals, alerts | Strongest scenario-first story of the three | Annotations supported | Teams that want simulation plus production grading in one workflow |
| LangSmith | Mature evaluation lifecycle from offline datasets to online evaluators | Strong online evaluators on traces and rich experiment workflows | Strong trajectory evaluation; good for agent path checks | Strong annotation queue workflow | Teams already using LangChain/LangGraph or wanting robust eval operations |
| Langfuse | Open-source observability, prompt management, and scoring around real traces | Strong scores on traces/observations and managed LLM-as-judge on production traces | Good offline experiments, but less scenario-centric in docs reviewed | Annotation queues supported | Teams that want open-source, self-hostable trace-centric evaluation |
My Practical Read
- Pick LangWatch if the missing piece is “we need better ways to rehearse and debug full agent behavior.”
- Pick LangSmith if the missing piece is “we need a disciplined evaluation system across offline, online, trajectories, and human review.”
- Pick Langfuse if the missing piece is “we want deep production tracing plus flexible scoring in an open-source stack.”
If I were advising a team from scratch, I would not ask only “Which tool has evals?” I would ask:
- Do we need simulations most?
- Do we need trajectory and evaluator operations most?
- Do we need trace-native, open-source production observability most?
That question usually makes the choice much clearer.
Where PMs and Engineers Should Align
This topic is not just for platform engineers.
PMs care about:
- Whether the agent completes the workflow users expect
- Whether regressions show up before support tickets pile up
- Whether quality can be explained in dashboards, not anecdotes
Engineers care about:
- Whether the tool-routing logic regressed
- Whether prompt changes increased retries or cost
- Whether low scores can be traced back to a specific step
Trace-aware grading gives both groups a better shared object:
- PMs get quality trends tied to user journeys
- Engineers get the exact spans, steps, and rubrics behind those scores
That is a much healthier operating model than arguing about one cherry-picked screenshot from staging.
Where to Start on Monday
If your team is early, do this:
- Pick one production workflow that matters, like refund handling or lead qualification.
- Define three or four span-level rubrics:
- retrieval relevance
- tool choice correctness
- final answer helpfulness
- workflow efficiency
- Add human review for low-score traces.
- Turn the worst real traces into offline datasets and scenario tests.
- Only then expand to more metrics.
If you try to evaluate everything at once, you will build dashboards nobody trusts.
If you start with one workflow and a small rubric, you can actually improve the system.
Final Take
Output-only evaluation made sense when many LLM apps were basically smart text functions.
Agents changed the game. Once a system can plan, call tools, branch, retry, and maintain state, the final answer stops being the whole story. The execution path becomes part of the product.
That is why I expect more production teams to treat trace grading as a standard layer of agent quality, while keeping scenario testing as the rehearsal layer that protects releases.
The real choice is not trace grading or scenario testing.
The real production move is:
- use scenario testing to practice behavior before release
- use trace-aware grading to control behavior after release
Sources
- LangWatch Docs: Better Agents Overview
- LangWatch Docs: Agent Simulations Introduction
- LangWatch Docs: Observability & Tracing
- LangWatch Docs: Custom Scoring
- LangSmith Docs: Evaluation Concepts
- LangSmith Docs: Evaluation Types
- LangSmith Docs: Online Evaluations
- LangSmith Docs: Trajectory Evaluations
- LangSmith Docs: Annotation Queues
- Langfuse Docs: Overview
- Langfuse Docs: Observability Overview
- Langfuse Docs: Datasets from Traces and Observations
- Langfuse Docs: Metrics Overview
- Langfuse Handbook: Why Langfuse Is Open Source