Evaluating AI Agents with LangWatch: From Vibes to Reliable Scores

You shipped an AI agent. It handles customer requests, routes tickets, maybe even drafts responses. In staging it looked great. In production, things get murkier. A user reports a wrong answer. Another says the tone felt off. Your team’s response? “It seemed fine when we tested it.”

This is the vibes-based evaluation trap, and most teams building with LLMs fall into it. The fix is systematic evaluation: running your agent against curated inputs, scoring outputs with measurable criteria, and tracking those scores over time. LangWatch is a platform built specifically for this.

This article walks through how LangWatch evaluations work using a fictional application — TravelBot, an AI-powered travel booking assistant — and explains why each layer of the evaluation stack matters.

The Problem: AI Agents Are Not Deterministic Functions

Traditional software testing relies on determinism: given input X, expect output Y. AI agents break this contract. The same prompt can produce different outputs across runs, models, or even temperature settings. A support agent might correctly answer “How do I cancel my booking?” ten times, then hallucinate a refund policy on the eleventh.

Research from Stanford and MIT suggests that systematic evaluation reduces production failures by up to 60% while accelerating deployment cycles. Yet according to recent industry surveys, only 52% of teams run offline evaluations and just 37% evaluate in production. Quality remains the number one blocker for agent deployments.

The question shifts from “does this pass?” to “how good is this, on average, across many examples?”

Meet TravelBot: Our Example Application

TravelBot is a Python-based AI assistant that helps users book flights, hotels, and rental cars. It has multiple specialized agents:

BookingAgent — processes reservation requests
SupportAgent — handles cancellations, changes, and complaints
RecommendationAgent — suggests destinations based on preferences
PolicyAgent — answers questions about refund policies, baggage rules, etc.

Each agent receives structured input (user profile, booking history, current request), calls external tools (flight search API, hotel availability, payment gateway), and returns a structured response with actions taken and follow-up items.

The codebase already has two testing layers:

Layer	What It Tests
Unit tests	Deterministic logic: input parsing, API response formatting, error handling
Scenario tests	Full agent behavior with mocked tools: multi-turn conversations, edge cases

What’s missing is batch quality measurement with scoring — running the agent against dozens or hundreds of examples and producing numerical scores you can track across prompt changes, model upgrades, and code refactors.

The Testing Pyramid for AI Applications

The AI agent testing pyramid adapts the classic software testing pyramid for non-deterministic systems:

         ╱  Scenario Tests  ╲           Few, expensive (real LLM calls)
        ╱    Multi-turn +     ╲         Binary pass/fail via judge agent
       ╱     judge agents      ╲
      ╱─────────────────────────╲
     ╱   Batch Evaluations       ╲     Many examples, scored numerically
    ╱    (LangWatch experiments)  ╲    Track quality over time
   ╱───────────────────────────────╲
  ╱         Unit Tests              ╲  Fast, deterministic, no LLM calls
 ╱          No API costs             ╲
╱━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╲

Scenario tests sit at the top: they’re realistic but expensive and produce binary results (pass or fail). Unit tests form the base: fast, cheap, deterministic. Batch evaluations occupy the middle — they make real LLM calls like scenarios, but score outputs numerically across many examples, giving you a quality trend line instead of a single verdict.

This middle layer is where LangWatch excels.

How LangWatch Evaluations Work

LangWatch provides four interconnected tools for evaluation:

1. Datasets — Your Curated Test Inputs

A dataset is a collection of inputs with optional expected outputs. For TravelBot, a good evaluation dataset covers edge cases that matter:

Scenario	What It Tests
User requests a valid round-trip flight	Happy path — agent books correctly
User asks to cancel a non-existent booking	Error handling — agent responds gracefully
User provides incomplete destination info	Clarification — agent asks follow-up questions
User requests a refund outside the policy window	Policy adherence — agent denies correctly
User sends the same request twice	Idempotency — agent doesn’t double-book
User switches languages mid-conversation	Robustness — agent maintains context

LangWatch supports importing datasets from CSV, generating them synthetically from documents, and continuously populating them from production traces (real user interactions that you flag for inclusion).

2. Evaluators — How You Score Outputs

Evaluators are functions that take an agent’s output and produce a score. LangWatch provides several categories:

Built-in evaluators for common patterns:

ragas/answer_relevancy — is the response relevant to the question?
ragas/faithfulness — does the response stick to provided context?
Exact match, BLEU score, ROUGE score for comparing against expected text

LLM-as-Judge evaluators that use an LLM to score another LLM’s output:

Boolean (pass/fail) — “Did the agent correctly identify the user’s intent?”
Score (0–1 numeric) — “How helpful was this response on a scale?”
Category — “Classify this response as: correct, partially correct, or incorrect”

Custom evaluators for domain-specific logic. This is where structured output evaluation shines. For TravelBot, a custom evaluator might look like:

def evaluate_booking_correctness(output_json: str, expected: dict) -> float:
    """Score 0-1 based on booking field accuracy."""
    actual = json.loads(output_json)

    fields = ["destination", "departure_date", "return_date", "passengers"]
    correct = sum(1 for f in fields if actual.get(f) == expected.get(f))

    return correct / len(fields)

This is critical for agents that produce structured outputs (Pydantic models, JSON schemas) rather than free text. Generic text-similarity evaluators like BLEU don’t work well for structured data — you need field-level comparison.

3. Experiments — Tying It All Together

An experiment runs your agent against a dataset, scores each output with one or more evaluators, and logs the results. Here’s the core pattern with LangWatch’s Python SDK:

import langwatch
import pandas as pd

dataset = pd.DataFrame({
    "input": [
        '{"request": "Book a flight to Tokyo", "user_id": "u123"}',
        '{"request": "Cancel booking BK-456", "user_id": "u789"}',
    ],
    "expected_destination": ["Tokyo", None],
    "expected_action": ["book_flight", "cancel_booking"],
})

experiment = langwatch.experiment.init("support-agent-v2")

for index, row in experiment.loop(dataset.iterrows()):
    response = await support_agent.run(row["input"])
    output = response.model_dump_json()

    experiment.log(
        "action_correctness",
        index=index,
        score=1.0 if response.action == row["expected_action"] else 0.0,
    )

    experiment.evaluate(
        "ragas/answer_relevancy",
        index=index,
        data={"input": row["input"], "output": output},
        settings={"model": "openai/gpt-4o-mini"},
    )

The key difference from scenario tests: you get continuous scores (0.87 relevancy, 0.93 action correctness) instead of binary pass/fail. When you change a prompt or swap models, you can compare scores across runs to detect regressions.

4. Monitors — Production Observability

Once your agent is in production, monitors continuously score live traffic:

User Request → Agent Response → LangWatch Trace → Monitor → Score → Dashboard

Monitors run asynchronously after the response is sent — they don’t add latency to the user experience. They feed dashboards and trigger alerts when quality drops below thresholds. This closes the loop: evaluation doesn’t end at deployment, it continues through the agent’s lifetime.

Pattern: Evaluating the Decision, Not the Integration

A common mistake is trying to evaluate everything at once: the LLM’s decision-making AND the external API calls. For TravelBot, you don’t want your evaluation to fail because the flight search API is down. You want to know if the agent decided correctly — did it pick the right tool, with the right parameters, for the right reason?

What Evaluations Should Test
━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Input              LLM Decision           Tool Call         External API
(User request) ►  (Which action?    ►   (search_flights)  ► (Amadeus API)
                   Which params?         (book_hotel)        (Stripe API)
                   Ask for more info?)

                ◄── EVALUATION BOUNDARY ──►
                    Evaluations test        Unit/integration
                    this side               tests cover this side

Mock your external tools during evaluation. Test the agent’s reasoning in isolation. This is the same principle behind the traditional testing pyramid — evaluate each layer independently.

Pattern: Scoring Structured Outputs

When agents return structured data (JSON, Pydantic models) instead of free text, generic evaluators fall short. You need evaluators that understand your schema.

For TravelBot’s booking response:

class BookingResponse(BaseModel):
    action: str            # "book_flight", "cancel", "modify"
    destination: str | None
    dates: DateRange | None
    passengers: int
    follow_ups: list[str]  # items requiring human attention

A comprehensive evaluation covers multiple dimensions:

Dimension	What to Measure	Evaluator Type
Action correctness	Did the agent choose the right action?	Custom (exact match)
Field accuracy	Are destination, dates, passengers correct?	Custom (field comparison)
Schema compliance	Does the output conform to the expected structure?	Custom (Pydantic validation)
Follow-up quality	Are follow-up items clear and actionable?	LLM-as-Judge (score)
Idempotency	Does the agent avoid duplicate actions?	Custom (state comparison)

Pattern: Comparing Models and Prompts

One of LangWatch’s strengths is running the same dataset against different configurations and comparing results side by side. For TravelBot, you might compare:

Model A (GPT-4o) vs. Model B (Claude Sonnet) on the same 50 booking requests
Prompt v1 (detailed instructions) vs. Prompt v2 (concise instructions) on action correctness
With guardrails vs. without guardrails on safety scores

The experiment dashboard shows score distributions, per-example breakdowns, and statistical comparisons. This turns model selection and prompt engineering from guesswork into data-driven decisions.

Cost and Scalability Considerations

Every evaluation run makes LLM calls for both the agent under test and the evaluator. With multiple agents and large datasets, costs add up quickly.

Practical strategies:

Use a cheaper model for evaluation judgments. If your agent runs on GPT-4o, use GPT-4o-mini or Claude Haiku for the LLM-as-Judge evaluator.
Run evaluations selectively. Only evaluate agents whose prompts or code changed in a given PR.
Cache agent responses. When iterating on evaluator logic, reuse cached outputs instead of re-running the agent.
Start small. Begin with 20–50 dataset rows covering critical edge cases. Expand as you identify gaps from production monitoring.

When to Use What

The three testing layers serve different purposes. Choosing the right one depends on what you’re trying to learn:

	Unit Tests	Batch Evaluations	Scenario Tests
Speed	Milliseconds	Minutes	Minutes
LLM calls	None	Many	Few
Result type	Pass/fail	Numeric score	Pass/fail
Best for	Deterministic logic	Quality tracking	Realistic conversations
Run frequency	Every commit	Per PR / nightly	Nightly / weekly
Example question	”Does input parsing work?"	"Across 50 inputs, how accurate is the agent?"	"Can the agent handle a multi-turn cancellation flow?”

The highest-value pattern: use production monitors to find failures, add those cases to your evaluation dataset, and track whether your fixes actually improve scores.

Getting Started

If you’re building AI agents and don’t have batch evaluations yet, here’s a practical starting path:

Phase 1: Foundation

Install langwatch and set your API key
Create a single experiment for your most critical agent
Start with 20 curated inputs covering known edge cases
Use one built-in evaluator plus one custom evaluator

Phase 2: Coverage

Expand datasets from production traces (real failures)
Add experiments for remaining agents
Integrate into CI — fail PRs if scores drop below threshold
Build custom evaluators for structured output fields

Phase 3: Production Loop

Enable LangWatch tracing in production
Set up monitors with alerting
Automatically populate datasets from low-scoring traces
Track quality trends across deployments

The key insight is that evaluation is not a one-time activity. It’s a continuous loop: measure, identify gaps, fix, re-measure. LangWatch’s combination of experiments, evaluators, datasets, and monitors provides the infrastructure for that loop.

Sources

LangWatch Documentation — Evaluations Overview: Core concepts for experiments, evaluators, datasets, and monitors. langwatch.ai/docs/evaluations/overview
LangWatch — Experiments via SDK: Python SDK reference for running batch experiments programmatically. langwatch.ai/docs/evaluations/experiments/sdk
LangWatch — Evaluating Structured Data Extraction: Guide for evaluating agents that produce structured outputs like JSON and Pydantic models. docs.langwatch.ai/use-cases/structured-outputs
LangWatch — Online Evaluation Overview: How monitors work for production quality tracking. docs.langwatch.ai/llm-evaluation/real-time-evaluation
LangWatch — List of Evaluators: Complete catalog of built-in evaluators including RAGAS, LLM-as-Judge, and custom evaluators. langwatch.ai/docs/evaluations/evaluators/list
Adaline — The Complete Guide to LLM & AI Agent Evaluation in 2026: Comprehensive overview of prompt-level, RAG, and agent evaluation strategies. adaline.ai/blog/complete-guide-llm-ai-agent-evaluation-2026
Orchestrator.dev — AI Evaluation: Tools, Techniques, and Best Practices for 2026: Research showing systematic evaluation reduces production failures by up to 60%. orchestrator.dev/blog/2026-02-18-ai-evaluation-guide-2026
Derek C. Ashmore — The AI Agent Testing Pyramid: A practical framework adapting the testing pyramid for non-deterministic AI systems. medium.com/@derekcashmore/the-ai-agent-testing-pyramid
Kevin Tan — How to Test AI Agents Before They Break Production: Industry data showing only 52% of teams run offline evals and 37% evaluate in production. blog.jztan.com/testing-ai-agents-in-production
Zylos Research — AI Agent Testing & Evaluation: The Complete 2026 Guide: The CLASSic framework (Cost, Latency, Accuracy, Stability, Security) for agent evaluation. zylos.ai/research/2026-01-12-ai-agent-testing-evaluation

Luis Mori Guerra

Recent Articles

Topics

Evaluating AI Agents with LangWatch: From Vibes to Scores