You shipped an AI agent. It handles customer requests, routes tickets, maybe even drafts responses. In staging it looked great. In production, things get murkier. A user reports a wrong answer. Another says the tone felt off. Your team’s response? “It seemed fine when we tested it.”
This is the vibes-based evaluation trap, and most teams building with LLMs fall into it. The fix is systematic evaluation: running your agent against curated inputs, scoring outputs with measurable criteria, and tracking those scores over time. LangWatch is a platform built specifically for this.
This article walks through how LangWatch evaluations work using a fictional application — TravelBot, an AI-powered travel booking assistant — and explains why each layer of the evaluation stack matters.
The Problem: AI Agents Are Not Deterministic Functions
Traditional software testing relies on determinism: given input X, expect output Y. AI agents break this contract. The same prompt can produce different outputs across runs, models, or even temperature settings. A support agent might correctly answer “How do I cancel my booking?” ten times, then hallucinate a refund policy on the eleventh.
Research from Stanford and MIT suggests that systematic evaluation reduces production failures by up to 60% while accelerating deployment cycles. Yet according to recent industry surveys, only 52% of teams run offline evaluations and just 37% evaluate in production. Quality remains the number one blocker for agent deployments.
The question shifts from “does this pass?” to “how good is this, on average, across many examples?”
Meet TravelBot: Our Example Application
TravelBot is a Python-based AI assistant that helps users book flights, hotels, and rental cars. It has multiple specialized agents:
- BookingAgent — processes reservation requests
- SupportAgent — handles cancellations, changes, and complaints
- RecommendationAgent — suggests destinations based on preferences
- PolicyAgent — answers questions about refund policies, baggage rules, etc.
Each agent receives structured input (user profile, booking history, current request), calls external tools (flight search API, hotel availability, payment gateway), and returns a structured response with actions taken and follow-up items.
The codebase already has two testing layers:
| Layer | What It Tests |
|---|---|
| Unit tests | Deterministic logic: input parsing, API response formatting, error handling |
| Scenario tests | Full agent behavior with mocked tools: multi-turn conversations, edge cases |
What’s missing is batch quality measurement with scoring — running the agent against dozens or hundreds of examples and producing numerical scores you can track across prompt changes, model upgrades, and code refactors.
The Testing Pyramid for AI Applications
The AI agent testing pyramid adapts the classic software testing pyramid for non-deterministic systems:
╱ Scenario Tests ╲ Few, expensive (real LLM calls)
╱ Multi-turn + ╲ Binary pass/fail via judge agent
╱ judge agents ╲
╱─────────────────────────╲
╱ Batch Evaluations ╲ Many examples, scored numerically
╱ (LangWatch experiments) ╲ Track quality over time
╱───────────────────────────────╲
╱ Unit Tests ╲ Fast, deterministic, no LLM calls
╱ No API costs ╲
╱━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╲
Scenario tests sit at the top: they’re realistic but expensive and produce binary results (pass or fail). Unit tests form the base: fast, cheap, deterministic. Batch evaluations occupy the middle — they make real LLM calls like scenarios, but score outputs numerically across many examples, giving you a quality trend line instead of a single verdict.
This middle layer is where LangWatch excels.
How LangWatch Evaluations Work
LangWatch provides four interconnected tools for evaluation:
1. Datasets — Your Curated Test Inputs
A dataset is a collection of inputs with optional expected outputs. For TravelBot, a good evaluation dataset covers edge cases that matter:
| Scenario | What It Tests |
|---|---|
| User requests a valid round-trip flight | Happy path — agent books correctly |
| User asks to cancel a non-existent booking | Error handling — agent responds gracefully |
| User provides incomplete destination info | Clarification — agent asks follow-up questions |
| User requests a refund outside the policy window | Policy adherence — agent denies correctly |
| User sends the same request twice | Idempotency — agent doesn’t double-book |
| User switches languages mid-conversation | Robustness — agent maintains context |
LangWatch supports importing datasets from CSV, generating them synthetically from documents, and continuously populating them from production traces (real user interactions that you flag for inclusion).
2. Evaluators — How You Score Outputs
Evaluators are functions that take an agent’s output and produce a score. LangWatch provides several categories:
Built-in evaluators for common patterns:
ragas/answer_relevancy— is the response relevant to the question?ragas/faithfulness— does the response stick to provided context?- Exact match, BLEU score, ROUGE score for comparing against expected text
LLM-as-Judge evaluators that use an LLM to score another LLM’s output:
- Boolean (pass/fail) — “Did the agent correctly identify the user’s intent?”
- Score (0–1 numeric) — “How helpful was this response on a scale?”
- Category — “Classify this response as: correct, partially correct, or incorrect”
Custom evaluators for domain-specific logic. This is where structured output evaluation shines. For TravelBot, a custom evaluator might look like:
def evaluate_booking_correctness(output_json: str, expected: dict) -> float:
"""Score 0-1 based on booking field accuracy."""
actual = json.loads(output_json)
fields = ["destination", "departure_date", "return_date", "passengers"]
correct = sum(1 for f in fields if actual.get(f) == expected.get(f))
return correct / len(fields)
This is critical for agents that produce structured outputs (Pydantic models, JSON schemas) rather than free text. Generic text-similarity evaluators like BLEU don’t work well for structured data — you need field-level comparison.
3. Experiments — Tying It All Together
An experiment runs your agent against a dataset, scores each output with one or more evaluators, and logs the results. Here’s the core pattern with LangWatch’s Python SDK:
import langwatch
import pandas as pd
dataset = pd.DataFrame({
"input": [
'{"request": "Book a flight to Tokyo", "user_id": "u123"}',
'{"request": "Cancel booking BK-456", "user_id": "u789"}',
],
"expected_destination": ["Tokyo", None],
"expected_action": ["book_flight", "cancel_booking"],
})
experiment = langwatch.experiment.init("support-agent-v2")
for index, row in experiment.loop(dataset.iterrows()):
response = await support_agent.run(row["input"])
output = response.model_dump_json()
experiment.log(
"action_correctness",
index=index,
score=1.0 if response.action == row["expected_action"] else 0.0,
)
experiment.evaluate(
"ragas/answer_relevancy",
index=index,
data={"input": row["input"], "output": output},
settings={"model": "openai/gpt-4o-mini"},
)
The key difference from scenario tests: you get continuous scores (0.87 relevancy, 0.93 action correctness) instead of binary pass/fail. When you change a prompt or swap models, you can compare scores across runs to detect regressions.
4. Monitors — Production Observability
Once your agent is in production, monitors continuously score live traffic:
User Request → Agent Response → LangWatch Trace → Monitor → Score → Dashboard
Monitors run asynchronously after the response is sent — they don’t add latency to the user experience. They feed dashboards and trigger alerts when quality drops below thresholds. This closes the loop: evaluation doesn’t end at deployment, it continues through the agent’s lifetime.
Pattern: Evaluating the Decision, Not the Integration
A common mistake is trying to evaluate everything at once: the LLM’s decision-making AND the external API calls. For TravelBot, you don’t want your evaluation to fail because the flight search API is down. You want to know if the agent decided correctly — did it pick the right tool, with the right parameters, for the right reason?
What Evaluations Should Test
━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Input LLM Decision Tool Call External API
(User request) ► (Which action? ► (search_flights) ► (Amadeus API)
Which params? (book_hotel) (Stripe API)
Ask for more info?)
◄── EVALUATION BOUNDARY ──►
Evaluations test Unit/integration
this side tests cover this side
Mock your external tools during evaluation. Test the agent’s reasoning in isolation. This is the same principle behind the traditional testing pyramid — evaluate each layer independently.
Pattern: Scoring Structured Outputs
When agents return structured data (JSON, Pydantic models) instead of free text, generic evaluators fall short. You need evaluators that understand your schema.
For TravelBot’s booking response:
class BookingResponse(BaseModel):
action: str # "book_flight", "cancel", "modify"
destination: str | None
dates: DateRange | None
passengers: int
follow_ups: list[str] # items requiring human attention
A comprehensive evaluation covers multiple dimensions:
| Dimension | What to Measure | Evaluator Type |
|---|---|---|
| Action correctness | Did the agent choose the right action? | Custom (exact match) |
| Field accuracy | Are destination, dates, passengers correct? | Custom (field comparison) |
| Schema compliance | Does the output conform to the expected structure? | Custom (Pydantic validation) |
| Follow-up quality | Are follow-up items clear and actionable? | LLM-as-Judge (score) |
| Idempotency | Does the agent avoid duplicate actions? | Custom (state comparison) |
Pattern: Comparing Models and Prompts
One of LangWatch’s strengths is running the same dataset against different configurations and comparing results side by side. For TravelBot, you might compare:
- Model A (GPT-4o) vs. Model B (Claude Sonnet) on the same 50 booking requests
- Prompt v1 (detailed instructions) vs. Prompt v2 (concise instructions) on action correctness
- With guardrails vs. without guardrails on safety scores
The experiment dashboard shows score distributions, per-example breakdowns, and statistical comparisons. This turns model selection and prompt engineering from guesswork into data-driven decisions.
Cost and Scalability Considerations
Every evaluation run makes LLM calls for both the agent under test and the evaluator. With multiple agents and large datasets, costs add up quickly.
Practical strategies:
- Use a cheaper model for evaluation judgments. If your agent runs on GPT-4o, use GPT-4o-mini or Claude Haiku for the LLM-as-Judge evaluator.
- Run evaluations selectively. Only evaluate agents whose prompts or code changed in a given PR.
- Cache agent responses. When iterating on evaluator logic, reuse cached outputs instead of re-running the agent.
- Start small. Begin with 20–50 dataset rows covering critical edge cases. Expand as you identify gaps from production monitoring.
When to Use What
The three testing layers serve different purposes. Choosing the right one depends on what you’re trying to learn:
| Unit Tests | Batch Evaluations | Scenario Tests | |
|---|---|---|---|
| Speed | Milliseconds | Minutes | Minutes |
| LLM calls | None | Many | Few |
| Result type | Pass/fail | Numeric score | Pass/fail |
| Best for | Deterministic logic | Quality tracking | Realistic conversations |
| Run frequency | Every commit | Per PR / nightly | Nightly / weekly |
| Example question | ”Does input parsing work?" | "Across 50 inputs, how accurate is the agent?" | "Can the agent handle a multi-turn cancellation flow?” |
The highest-value pattern: use production monitors to find failures, add those cases to your evaluation dataset, and track whether your fixes actually improve scores.
Getting Started
If you’re building AI agents and don’t have batch evaluations yet, here’s a practical starting path:
Phase 1: Foundation
- Install
langwatchand set your API key - Create a single experiment for your most critical agent
- Start with 20 curated inputs covering known edge cases
- Use one built-in evaluator plus one custom evaluator
Phase 2: Coverage
- Expand datasets from production traces (real failures)
- Add experiments for remaining agents
- Integrate into CI — fail PRs if scores drop below threshold
- Build custom evaluators for structured output fields
Phase 3: Production Loop
- Enable LangWatch tracing in production
- Set up monitors with alerting
- Automatically populate datasets from low-scoring traces
- Track quality trends across deployments
The key insight is that evaluation is not a one-time activity. It’s a continuous loop: measure, identify gaps, fix, re-measure. LangWatch’s combination of experiments, evaluators, datasets, and monitors provides the infrastructure for that loop.
Sources
-
LangWatch Documentation — Evaluations Overview: Core concepts for experiments, evaluators, datasets, and monitors. langwatch.ai/docs/evaluations/overview
-
LangWatch — Experiments via SDK: Python SDK reference for running batch experiments programmatically. langwatch.ai/docs/evaluations/experiments/sdk
-
LangWatch — Evaluating Structured Data Extraction: Guide for evaluating agents that produce structured outputs like JSON and Pydantic models. docs.langwatch.ai/use-cases/structured-outputs
-
LangWatch — Online Evaluation Overview: How monitors work for production quality tracking. docs.langwatch.ai/llm-evaluation/real-time-evaluation
-
LangWatch — List of Evaluators: Complete catalog of built-in evaluators including RAGAS, LLM-as-Judge, and custom evaluators. langwatch.ai/docs/evaluations/evaluators/list
-
Adaline — The Complete Guide to LLM & AI Agent Evaluation in 2026: Comprehensive overview of prompt-level, RAG, and agent evaluation strategies. adaline.ai/blog/complete-guide-llm-ai-agent-evaluation-2026
-
Orchestrator.dev — AI Evaluation: Tools, Techniques, and Best Practices for 2026: Research showing systematic evaluation reduces production failures by up to 60%. orchestrator.dev/blog/2026-02-18-ai-evaluation-guide-2026
-
Derek C. Ashmore — The AI Agent Testing Pyramid: A practical framework adapting the testing pyramid for non-deterministic AI systems. medium.com/@derekcashmore/the-ai-agent-testing-pyramid
-
Kevin Tan — How to Test AI Agents Before They Break Production: Industry data showing only 52% of teams run offline evals and 37% evaluate in production. blog.jztan.com/testing-ai-agents-in-production
-
Zylos Research — AI Agent Testing & Evaluation: The Complete 2026 Guide: The CLASSic framework (Cost, Latency, Accuracy, Stability, Security) for agent evaluation. zylos.ai/research/2026-01-12-ai-agent-testing-evaluation