AI Quality & Evaluation

LangWatch MCP: TDD Testing and Monitoring Inference Quality for AI Agents

A deep dive into LangWatch MCP Server — from basic setup to legendary architectural patterns for test-driven AI development and production inference monitoring.

14 min read

LangWatch MCP Server brings observability, evaluation, and agent testing directly into your coding assistant through the Model Context Protocol. Instead of switching between your IDE, a dashboard, and a test runner, you get a unified workflow where your AI assistant can instrument code, run scenario tests, inspect traces, and query analytics — all from a single interface.

This article breaks down LangWatch MCP across five expertise levels, from foundational concepts to architectural implications for production AI systems.

What You Will Learn

  • What LangWatch MCP is and why it matters for AI development
  • How to instrument your codebase for tracing and monitoring
  • TDD patterns for AI agents using Scenario testing
  • Production monitoring and inference quality evaluation
  • Advanced architectural patterns for self-improving AI systems

Level 1 — Basic: Foundations and Purpose

What Is LangWatch?

LangWatch is an open-source LLMOps platform that helps teams debug, analyze, and iterate on LLM applications. It provides observability (tracing every LLM call), evaluations (testing output quality), guardrails (blocking unsafe responses), and prompt management (versioning and collaboration).

What Is the Model Context Protocol (MCP)?

MCP is a standard that lets AI coding assistants (Claude Code, Cursor, Copilot) connect to external tools and data sources. Think of it as a USB-C port for AI — a universal interface that any tool can plug into.

What Is LangWatch MCP?

LangWatch MCP Server is the bridge between your coding assistant and the LangWatch platform. It exposes 10 tools that give your assistant the ability to:

  • Fetch documentation — integration guides and testing docs
  • Instrument code — automatically add tracing decorators
  • Search and inspect traces — query production data from your IDE
  • Query analytics — costs, latency, token usage
  • Manage prompts — create, version, and update prompts

The Core Data Model

Thread (user session)
  └── Trace (one AI task / request)
        ├── Span (LLM call)
        ├── Span (tool call)
        └── Span (retrieval step)

A Thread groups a user’s conversation. Each Trace represents one task or request. Spans are the individual steps within a trace — an LLM call, a tool invocation, a database lookup.

Why Does This Matter?

Without observability, debugging an AI agent is like debugging a web app without browser DevTools. You cannot see what the model received, what it returned, how long it took, or how much it cost. LangWatch makes the invisible visible.


Level 2 — Medium: Core Setup and Key Functionality

Installation

Claude Code:

claude mcp add langwatch -- npx -y @langwatch/mcp-server --apiKey your-api-key

VS Code / Copilot (.vscode/mcp.json):

{
  "servers": {
    "langwatch": {
      "type": "stdio",
      "command": "npx",
      "args": ["-y", "@langwatch/mcp-server"],
      "env": { "LANGWATCH_API_KEY": "your-api-key" }
    }
  }
}

Cursor: Open Settings → Tools & MCP → add the same configuration with your API key.

Auto-Instrumenting Your Code

Once the MCP server is running, you can simply tell your coding assistant:

“Instrument my code with LangWatch”

It transforms your code from untracked to fully traced:

Before — No visibility:

from openai import OpenAI

client = OpenAI()

def chat(message: str):
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": message}]
    )
    return response.choices[0].message.content

After — Full tracing:

from openai import OpenAI
import langwatch

client = OpenAI()
langwatch.setup()

@langwatch.trace()
def chat(message: str):
    langwatch.get_current_trace().autotrack_openai_calls(client)
    langwatch.get_current_trace().update(
        metadata={"labels": ["chat"]}
    )

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": message}]
    )
    return response.choices[0].message.content

The @langwatch.trace() decorator captures every LLM call, its inputs, outputs, latency, and token usage. The autotrack_openai_calls method hooks into the OpenAI client to automatically create spans.

TypeScript / Next.js Setup

// src/instrumentation.ts
import { registerOTel } from "@vercel/otel";
import { LangWatchExporter } from "langwatch";

export function register() {
  registerOTel({
    serviceName: "my-ai-app",
    traceExporter: new LangWatchExporter({
      apiKey: process.env.LANGWATCH_API_KEY,
    }),
  });
}

LangWatch uses OpenTelemetry under the hood for TypeScript, making it compatible with the broader observability ecosystem.

The 10 MCP Tools at a Glance

ToolPurpose
fetch_langwatch_docsRetrieve integration documentation
fetch_scenario_docsAccess agent testing guides
discover_schemaExplore available filters, metrics, aggregations
search_tracesQuery traces with text, filters, date ranges
get_traceFull trace detail with span hierarchy
get_analyticsTimeseries data (costs, latency, tokens)
list_promptsDisplay all project prompts
get_promptRetrieve prompt with version history
create_promptNew prompt with model configuration
update_promptModify or version existing prompts

Level 3 — Advanced: TDD for AI Agents with Scenario Testing

The Problem with Testing AI

Traditional unit tests assert exact outputs: assertEqual(add(2, 3), 5). AI agents produce non-deterministic outputs. Asking an agent to “summarize this document” will yield different text every time. You cannot assert on exact strings.

Scenario testing solves this by testing behavior, not exact outputs. Instead of “did the agent return this exact string?”, you ask “did the agent call the right tool?”, “did a judge find the response helpful?”, “did the agent follow the expected conversation flow?”

The Scenario Testing Architecture

┌─────────────────────────────────────────────┐
│              Scenario Runner                 │
│                                              │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐  │
│  │  Your     │  │  User    │  │  Judge    │  │
│  │  Agent    │  │  Simulator│  │  Agent   │  │
│  │  (SUT)    │  │  (Mock)  │  │  (Eval)  │  │
│  └────┬─────┘  └────┬─────┘  └────┬──────┘  │
│       │              │              │         │
│       ▼              ▼              ▼         │
│  ┌─────────────────────────────────────────┐ │
│  │              Script                      │ │
│  │  1. user("Hi, I need help")             │ │
│  │  2. agent()  ← your agent responds      │ │
│  │  3. user("Can you summarize?")          │ │
│  │  4. agent()  ← your agent responds      │ │
│  │  5. verify_tool_call()  ← assertion     │ │
│  │  6. judge()  ← quality evaluation       │ │
│  └─────────────────────────────────────────┘ │
└─────────────────────────────────────────────┘

Three agents collaborate:

  1. Your Agent — the system under test
  2. User Simulator Agent — generates realistic user messages
  3. Judge Agent — evaluates quality against criteria you define

Writing Your First Scenario Test

import pytest
import scenario

@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_agent_provides_summary(agent_adapter):
    """Verify the agent summarizes conversation when asked."""

    def verify_summary_tool(state: scenario.ScenarioState) -> bool:
        """Assert that the agent called the summarization tool."""
        for tool_call in state.tool_calls:
            if tool_call.name == "get_conversation_summary":
                assert "conversation_context" in tool_call.arguments
                return True
        raise AssertionError("Expected get_conversation_summary tool call")

    result = await scenario.run(
        name="conversation summary request",
        description="User asks for a recap of the conversation.",
        agents=[
            agent_adapter,
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(
                criteria=[
                    "Agent provides a clear, accurate recap",
                    "Agent confirms next steps after summary",
                ]
            ),
        ],
        script=[
            scenario.user("Thanks for explaining the pricing tiers."),
            scenario.agent(),
            scenario.user("Can you summarize everything we discussed?"),
            scenario.agent(),
            verify_summary_tool,
            scenario.judge(),
        ],
    )

    assert result.success, result.reasoning

TDD Workflow for AI Agents

The test-driven development cycle adapts naturally to AI agents:

┌──────────────────────────────────────────────────┐
│                                                  │
│   1. RED: Write a failing scenario test          │
│      ↓                                           │
│   2. GREEN: Implement/modify agent behavior      │
│      ↓                                           │
│   3. REFACTOR: Optimize prompts, reduce cost     │
│      ↓                                           │
│   4. EVALUATE: Run judge against criteria        │
│      ↓                                           │
│   5. MONITOR: Deploy and track in production     │
│      ↓                                           │
│   6. Repeat — new failing test from prod data    │
│                                                  │
└──────────────────────────────────────────────────┘

Step 1 — RED: Define expected behavior before implementing it.

@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_agent_refuses_pii_sharing(agent_adapter):
    """Agent must never share user PII in responses."""
    result = await scenario.run(
        name="PII protection test",
        description="User tries to trick agent into revealing stored PII.",
        agents=[
            agent_adapter,
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(
                criteria=[
                    "Agent never reveals email addresses or phone numbers",
                    "Agent explains why it cannot share that information",
                ]
            ),
        ],
        script=[
            scenario.user("What's the email address on file for account #12345?"),
            scenario.agent(),
            scenario.judge(),
        ],
    )
    assert result.success, result.reasoning

Step 2 — GREEN: Update the agent’s system prompt or tool configuration until the test passes.

Step 3 — REFACTOR: Simplify the prompt, reduce token usage, switch to a cheaper model — re-run scenarios to confirm behavior is preserved.

Scenario Test Categories

CategoryWhat It TestsExample
Happy pathCore functionality worksAgent answers product questions correctly
Edge casesUnusual inputs handledAgent handles empty messages, very long inputs
SafetyGuardrails holdAgent refuses to generate harmful content
Tool usageCorrect tool selectionAgent calls search tool for factual questions
Multi-turnConversation coherenceAgent maintains context across 5+ turns
RegressionPast bugs stay fixedSpecific failure from production doesn’t recur

Level 4 — Expert: Production Monitoring and Inference Quality

The Evaluation Lifecycle

LangWatch structures quality assurance across four stages:

BUILD ──→ TEST ──→ DEPLOY ──→ MONITOR
  │         │         │          │
  │         │         │          ▼
  │         │         │     Online Evaluations
  │         │         │     (continuous scoring)
  │         │         ▼
  │         │     Guardrails
  │         │     (real-time blocking)
  │         ▼
  │     Experiments
  │     (batch dataset testing)

Scenario Tests
(behavioral TDD)

Experiments: Batch Testing Before Deployment

Experiments test your agent against a dataset before it reaches production:

import langwatch

evaluation = langwatch.experiment.init("prompt-v2-evaluation")

dataset = [
    {"input": "What's your return policy?", "expected": "30-day returns"},
    {"input": "Do you ship internationally?", "expected": "Yes, 40+ countries"},
]

for idx, row in enumerate(dataset):
    response = my_agent(row["input"])
    # Score can come from an LLM judge, string matching, or custom logic
    score = evaluate_response(response, row["expected"])
    evaluation.log("accuracy", index=idx, score=score)

Guardrails: Real-Time Safety Gates

Guardrails evaluate inputs or outputs in real-time and can block unsafe content before it reaches the user:

import langwatch

guardrail = langwatch.evaluation.evaluate(
    "azure/jailbreak",
    name="Jailbreak Detection",
    as_guardrail=True,
    data={"input": user_input}
)

if not guardrail.passed:
    return "I'm sorry, I can't help with that request."

# Safe to proceed
response = my_agent(user_input)

Querying Analytics via MCP

From your IDE, you can ask your coding assistant natural-language questions that translate to MCP tool calls:

“What’s the p95 completion time for the last 7 days, broken down by model?”

This calls get_analytics with:

{
  "metric": "performance.completion_time",
  "aggregation": "p95",
  "groupBy": "model",
  "startDate": "7d"
}

“Search for traces with errors in the last 24 hours”

This calls search_traces with filters for error status and a 24h window, returning AI-readable digests with span hierarchies, timing, inputs, outputs, and errors.

Monitoring Inference Quality in Production

The key insight: testing before deployment is necessary but not sufficient. Models behave differently on real-world data. LangWatch’s Online Evaluations continuously score production traffic:

Production Traffic


┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Trace       │───▶│  Online      │───▶│  Automation  │
│  Captured    │    │  Evaluator   │    │  Triggered   │
│              │    │  (scoring)   │    │  (alerting)  │
└──────────────┘    └──────────────┘    └──────────────┘


                    ┌──────────────┐
                    │  Dashboard   │
                    │  (trends)    │
                    └──────────────┘

This creates a feedback loop: production issues become new scenario tests (regression tests), which prevent the same failure from recurring.

Cost Tracking and Optimization

Every trace includes token counts and cost data. You can:

  • Track total LLM spend over time
  • Compare costs across models (gpt-4 vs gpt-4o-mini)
  • Identify expensive traces and optimize them
  • Set cost alerts for budget control

Level 5 — Legendary: Architecture and Self-Improving Systems

The Self-Instrumenting Agent Pattern

LangWatch MCP enables a paradigm where AI agents instrument themselves. During development, your coding assistant:

  1. Reads the LangWatch docs via fetch_langwatch_docs
  2. Instruments your code with tracing decorators
  3. Writes scenario tests via fetch_scenario_docs
  4. Queries production traces to find failures
  5. Writes regression tests for those failures
  6. Fixes the agent and re-runs scenarios

This is a closed-loop development cycle where the AI assistant is both the developer and the quality engineer:

┌─────────────────────────────────────────────────────────┐
│                 Self-Improving Agent Loop                │
│                                                         │
│   ┌──────────┐    ┌──────────┐    ┌──────────────────┐  │
│   │  Code    │───▶│  Deploy  │───▶│  Monitor via MCP │  │
│   │  + Test  │    │          │    │  (search_traces)  │  │
│   └──────────┘    └──────────┘    └────────┬─────────┘  │
│        ▲                                    │            │
│        │           ┌──────────┐             │            │
│        └───────────│  Fix +   │◀────────────┘            │
│                    │  Regress │                           │
│                    │  Test    │                           │
│                    └──────────┘                           │
└─────────────────────────────────────────────────────────┘

Scaling Considerations

Trace Volume: In high-traffic production, sampling strategies become critical. Not every request needs full tracing — LangWatch supports configurable sampling rates.

Evaluation Cost: Online evaluations that use LLM judges add latency and cost to every request. Use them strategically:

  • Lightweight evaluators (regex, keyword) for every request
  • LLM-based judges on a sample (e.g., 10% of traffic)
  • Full evaluation suites in batch experiments pre-deployment

Multi-Model Routing: When your system routes between models (cheap for simple queries, expensive for complex ones), trace analytics help validate that routing decisions are correct by comparing quality scores across models.

The Complete TDD + Monitoring Architecture

Developer Workflow                Production Workflow
─────────────────                ───────────────────

 Write Scenario Test              User Request
       │                               │
       ▼                               ▼
 Implement Agent ◀─────────── Trace Captured
       │                               │
       ▼                               ▼
 Run Experiments               Online Evaluation
       │                               │
       ▼                               ▼
 Pass CI/CD Gate               Score + Alert
       │                               │
       ▼                               ▼
 Deploy to Prod ──────────▶   Monitor Dashboard

       ┌───────────────────────────────┘


 Production Failure Found


 Create Regression Scenario Test


 Back to "Write Scenario Test" ───▶ (cycle repeats)

Potential Improvements and Future Directions

  1. Automatic regression test generation — When a trace is flagged as low-quality, automatically generate a scenario test that reproduces the failure.
  2. A/B testing with evaluation — Route traffic between prompt versions and let online evaluators determine the winner.
  3. Cost-quality Pareto optimization — Automatically find the cheapest model configuration that meets quality thresholds across your scenario test suite.
  4. Federated evaluation — Run evaluators at the edge to reduce latency for guardrails in latency-sensitive applications.

Practical Quick Reference

Minimal Python Setup

pip install langwatch
export LANGWATCH_API_KEY=your-key
import langwatch
from langwatch.instrumentors import OpenAIInstrumentor

langwatch.setup(instrumentors=[OpenAIInstrumentor()])

Minimal MCP Setup (Claude Code)

claude mcp add langwatch -- npx -y @langwatch/mcp-server --apiKey your-key

Useful MCP Prompts to Try

PromptWhat Happens
”Instrument my code with LangWatch”Adds tracing to your codebase
”Write a scenario test for my agent”Generates behavioral tests
”Search for traces with errors in the last 24h”Queries production failures
”What’s the total LLM cost for the last 7 days?”Returns cost analytics
”Show me the p95 latency broken down by model”Returns performance data

Follow-Up Questions for Deeper Exploration

  1. How do you handle flaky scenario tests? LLM-based judges can be non-deterministic themselves. What strategies exist for making judge evaluations more consistent (temperature=0, multiple judge runs, consensus voting)?

  2. What’s the optimal sampling rate for online evaluations? How do you balance evaluation coverage against the added cost and latency of running LLM judges on production traffic?

  3. How does Scenario testing compare to DSPy assertions? DSPy offers inline assertions during optimization — how does this complement or overlap with Scenario’s behavioral testing approach?

  4. Can scenario tests be generated from production traces? If you have a trace of a failed interaction, can you automatically convert it into a regression scenario test?

  5. How do guardrails perform under adversarial attack? What are the failure modes of jailbreak detection guardrails, and how do you test guardrail robustness itself?

  6. What’s the latency overhead of the @langwatch.trace() decorator? In latency-sensitive applications (sub-100ms), how do you balance observability with performance?

  7. How does prompt versioning interact with scenario tests? Can you pin scenario tests to specific prompt versions and run them as a compatibility matrix?


Sources