AI Quality & Evaluation

LLM-as-a-Judge for Agent Apps: Biases, Blind Spots, and Fixes

LLM-as-a-judge can be one of the most useful patterns in agent evaluation, but only if you understand where it breaks: order bias, self-preference, verbosity bias, weak judges, and evidence-free scoring. This guide explains the pattern, the common traps, and the fixes that make it practical.

17 min read

LLM-as-a-judge is one of those ideas that feels almost too convenient: use one model to evaluate another model, and suddenly your agent team has a scalable way to score quality without reviewing every run by hand.

That pattern is real, and it can be extremely useful. But it is also easy to misuse.

The research is fairly clear on the shape of the problem. Judge models can agree with humans surprisingly well on some tasks. They can also be biased by answer order, prefer outputs written in their own style, over-reward verbosity, drift when the rubric changes slightly, and confidently score something they cannot actually verify.

So the right question is not, “Should we use LLM-as-a-judge?”

It is:

Where does it fit in the testing stack, what should it judge, and what guardrails make the scores trustworthy enough to act on?

If you want the broader evaluation stack around this topic, the repo already has two useful companion pieces:

This article goes narrower and deeper on the judge pattern itself.

TL;DR

  • LLM-as-a-judge works best as one layer in agent evaluation, not as your only source of truth.
  • It is most reliable when you ask it to compare options, score against a strict rubric, or inspect a trace with evidence, not when you ask for vague open-ended opinions.
  • The main failure modes are position bias, self/family bias, familiarity bias, verbosity bias, prompt sensitivity, and weak-judge limits.
  • The highest-leverage fixes are order swapping, structured rubrics, deterministic checks for objective fields, trace-aware judging, human calibration, and selective escalation.
  • My engineering takeaway from the sources: if your judge cannot see the evidence behind the answer, it is often grading plausibility, not correctness.

What You Will Learn Here

  • What LLM-as-a-judge actually means in the context of agent apps
  • Where it fits relative to unit tests, scenario tests, batch evals, and trace grading
  • Which judging patterns are most useful in practice
  • What the research says about common judge biases and inconsistencies
  • How to harden the pattern so PMs and Engineers can use the scores without fooling themselves

Why Agent Apps Need More Than Deterministic Tests

Traditional software testing assumes that given the same input, the system should produce the same output. That still matters for agent apps, but it only covers part of the problem.

You can and should write deterministic tests for things like:

  • tool schemas
  • permission boundaries
  • routing logic
  • JSON validation
  • policy lookups
  • error handling

But those tests do not answer the questions that usually hurt agent teams in production:

  • Did the agent ask the right follow-up question?
  • Did it choose the right tool for the right reason?
  • Did it follow the policy while still sounding helpful?
  • Did the newer prompt improve the experience, or just make the answer longer?

As of March 30, 2026, OpenAI’s Evaluation best practices guide makes the same core point in plainer product language: generative systems are variable, so you need evals designed for that variability, and you need to maintain agreement with human judgment rather than relying on vibes.

That is the gap LLM-as-a-judge tries to fill.

Where LLM-as-a-Judge Fits in the Testing Stack

For agent apps, I think the cleanest mental model is not “judge everything.” It is “use the judge for the parts that are semantic and expensive to label manually.”

LayerWhat It AnswersBest Mechanism
Unit testsDid deterministic logic behave correctly?Traditional tests
Scenario testsCan the agent complete a realistic workflow end to end?Multi-turn simulation + assertions + judge
Batch evalsAcross many examples, is the agent getting better or worse?Deterministic checks + rubric judging
Trace gradingWhere in the workflow did the agent go wrong?Trace-aware graders on real runs

As of March 30, 2026, OpenAI’s Agent evals guide recommends trace grading for workflow-level errors. That is an important shift. It means the more mature conversation is no longer just “Did the final answer look good?” but also “What happened inside the run?”

That distinction matters because many agent failures are not purely textual:

  • wrong tool selected
  • correct tool selected with wrong parameters
  • tool result ignored
  • missing escalation
  • incomplete evidence

If you only judge the final answer, you often lose the most useful debugging information.

What LLM-as-a-Judge Is Good At

The 2023 paper Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena is still one of the most helpful baseline references here. It found that strong judges such as GPT-4 achieved over 80% agreement with human preferences on those benchmarks, roughly matching human-human agreement there.

That is the optimistic result, and it matters.

But the same paper also explicitly calls out limitations, including:

  • position bias
  • verbosity bias
  • self-enhancement bias
  • limited reasoning ability

So the fair summary is:

LLM judges are useful approximators of human preference on some evaluation tasks, not neutral oracles.

That is good enough for a lot of product work, as long as you design around the weaknesses.

Judge Patterns That Work Best

OpenAI’s evaluation guidance says models are generally better at discriminating between options than at producing open-ended judgments from scratch. That lines up well with what many teams discover empirically.

Here are the patterns I trust most, ordered from safer to riskier.

1. Pairwise Comparison

Ask the judge to compare two candidate outputs for the same task.

Example:

  • Prompt version A vs prompt version B
  • Old model vs new model
  • With retrieval guardrails vs without retrieval guardrails

This is often the best pattern for regressions because the judge is doing comparative work, not inventing a global standard from scratch.

Best use cases:

  • prompt experiments
  • model swaps
  • ranking alternatives

Main caveat:

  • answer order can bias the result, so always test both directions

2. Rubric-Based Scoring

Ask the judge to score a single output against a structured rubric with explicit dimensions.

Example dimensions:

  • factual grounding
  • policy adherence
  • clarity
  • completeness
  • appropriate escalation

This is usually the most practical pattern for production dashboards because it turns squishy quality discussions into a small set of repeatable dimensions.

Best use cases:

  • weekly quality reviews
  • release gates
  • scorecards for PM and engineering discussions

Main caveat:

  • vague rubrics create vague scores

3. Pointwise Pass/Fail

Ask the judge whether a single output passed a narrowly scoped criterion.

Example:

  • Did the agent ask for the missing booking ID before attempting cancellation?
  • Did the response stay within refund policy?

This can work well when the criterion is concrete and observable.

Best use cases:

  • compliance checks
  • narrow policy checks
  • single-goal assertions

Main caveat:

  • broad pass/fail questions usually hide ambiguity instead of reducing it

4. Trace-Aware Judging

Pass the judge the trace, not just the final response:

  • user input
  • selected tools
  • tool arguments
  • tool outputs
  • retrieved passages
  • final answer

This is the most important pattern for agent apps specifically.

A trace-aware judge can answer questions like:

  • Did the agent call the correct tool?
  • Did it ignore contradictory evidence?
  • Did it escalate when the policy required escalation?

That is much closer to how humans debug agents in real systems.

A Concrete Example for a Tool-Using Support Agent

Below is an OpenAI-flavored but framework-agnostic evaluation pattern. The main idea is to combine cheap deterministic checks with a stricter rubric judge rather than asking the judge to do everything.

from dataclasses import dataclass


@dataclass
class EvalCase:
    user_message: str
    expected_tool: str
    expected_escalation: bool
    expected_policy_id: str


def run_support_agent(user_message: str) -> dict:
    """
    Returns a trace-like structure:
    {
      "tool_calls": [...],
      "tool_results": [...],
      "final_answer": "...",
      "escalated": bool,
      "policy_id_used": "refund_policy_v3"
    }
    """
    ...


def grade_with_llm(model: str, rubric: str, payload: dict) -> dict:
    """
    Ask a judge model to return structured scores.
    In production, this can map to your eval platform or internal grader.
    """
    ...


case = EvalCase(
    user_message="I missed the deadline by two days. Can I still get a refund?",
    expected_tool="lookup_refund_policy",
    expected_escalation=False,
    expected_policy_id="refund_policy_v3",
)

trace = run_support_agent(case.user_message)

# Deterministic checks for objective facts
deterministic_scores = {
    "correct_tool": int(
        any(
            call.get("name") == case.expected_tool
            for call in (trace.get("tool_calls") or [])
        )
    ),
    "correct_policy": int(trace["policy_id_used"] == case.expected_policy_id),
    "correct_escalation_flag": int(trace["escalated"] == case.expected_escalation),
}

rubric = """
You are grading a customer-support agent trace.
Return a JSON object only, with no explanations or additional text.
The JSON must have exactly these keys, each mapped to an integer 0 or 1:
- policy_adherence
- user_clarity
- action_reasoning

Output format (example, values may change but structure must not):
{
  "policy_adherence": 1,
  "user_clarity": 0,
  "action_reasoning": 1
}

Rules:
- Use the tool outputs and policy_id_used as evidence.
- Do not reward extra verbosity.
- If the final answer sounds plausible but contradicts the policy evidence,
  give policy_adherence = 0.
"""

judge_scores = grade_with_llm(
    model="gpt-4.1-mini",
    rubric=rubric,
    payload={
        "user_message": case.user_message,
        "tool_calls": trace["tool_calls"],
        "tool_results": trace["tool_results"],
        "final_answer": trace["final_answer"],
        "policy_id_used": trace["policy_id_used"],
    },
)

final_scores = {
    **deterministic_scores,
    **judge_scores,
}

The pattern here is simple:

  • let code verify objective facts
  • let the judge score semantic quality
  • force the judge to look at evidence

That split removes a lot of unnecessary judge work and makes the scores easier to trust.

The Evaluation Flywheel

This is the workflow I would recommend to most teams building agent apps:

Production runs
      |
      v
Low-score traces / user complaints / sampled sessions
      |
      v
Curated eval dataset
      |
      +--------------------+
      |                    |
      v                    v
Deterministic checks   LLM judge rubrics
      |                    |
      +---------+----------+
                |
                v
        Compare runs over time
                |
                v
   Update prompts / tools / routing / policies
                |
                v
         Re-run evals before ship
                |
                v
            Back to production

The key point is that LLM-as-a-judge should sit inside a continuous loop, not as a one-off benchmark.

Common Issues with LLM-as-a-Judge

This is where most teams get burned.

1. Position Bias

The paper Large Language Models are not Fair Evaluators showed that rankings can be manipulated by changing the order in which candidate answers appear. In other words, the same two answers can produce different outcomes depending on which one is shown first.

That is not a minor academic edge case. It directly affects any pairwise eval setup.

Practical implication: if you compare prompt A and prompt B only once, your leaderboard may partly be an artifact of ordering.

2. Self-Preference Bias and Same-Family Bias

The paper Self-Preference Bias in LLM-as-a-Judge found that judges can prefer outputs that are more familiar to them, and the authors connect a large part of that effect to lower perplexity. The paper Replacing Judges with Juries makes a related point from a systems angle: single large judges can introduce intra-model bias, and a diverse panel of smaller judges can reduce that bias while cutting cost.

In practice, teams often describe this as a same-family bias problem:

  • one model family prefers its own style
  • one judge over-rewards its own preferred tone or formatting
  • a model that sounds “more like the judge” gets extra credit

Practical implication: if the same model family generates and judges, your evals may reward stylistic familiarity rather than user value.

3. Familiarity Bias

The paper Large Language Models are Inconsistent and Biased Evaluators found familiarity bias, skewed rating distributions, anchoring effects, and sensitivity to prompt changes that humans would not consider meaningful.

This matters because many product teams iterate on wording constantly. If your judge score moves a lot because the judge prompt changed slightly, you may think you improved the product when you really changed the measuring device.

Practical implication: a drifting judge can create fake regressions and fake improvements.

4. Verbosity and Style Bias

The MT-Bench paper explicitly calls out verbosity bias. This is the classic issue where a longer answer looks more impressive even when it is not more correct.

This is especially dangerous in PM reviews because “sounds better” is often visually confounded with “is better.”

Practical implication: if your judge over-rewards long answers, teams may optimize toward impressive wording instead of task completion.

5. Evidence-Free Grading and Hallucinated Confidence

This one is partly documented and partly engineering inference from how trace grading changes the problem.

OpenAI’s Trace grading guide argues that black-box evaluation loses the information needed to understand why an agent succeeded or failed. I agree with that strongly.

My inference from the sources and from production experience is:

When a judge only sees the final answer, it often cannot verify factual correctness. It can only verify whether the answer looks plausible.

For agent apps, that creates a predictable failure mode:

  • the agent used the wrong tool
  • the answer still sounds polished
  • the judge gives it a decent score

That is not really hallucination in the generator alone. It is also a kind of hallucinated confidence in the evaluator.

6. Weak Judges vs Strong Agents

The paper On scalable oversight with weak LLMs judging strong LLMs explores what happens when weaker judges supervise stronger agents across tasks like math, coding, logic, and multimodal reasoning. The answer is nuanced, but the high-level lesson is simple: weaker judges are not free supervision. Their reliability depends on the task and the protocol.

The paper Trust or Escalate is useful here because it pushes toward a more operational answer: when confidence is low, escalate to humans or stronger review rather than pretending the score is equally trustworthy on every example.

Practical implication: the cheapest judge is not always the safest judge, especially on hard reasoning tasks.

Fixes That Make the Pattern Usable

The good news is that most of the biggest judge problems have boring, practical fixes.

1. Swap Order and Aggregate Both Directions

If you run pairwise comparisons, score both:

  • A vs B
  • B vs A

Then aggregate.

This is the minimum defense against position bias, and it is directly aligned with the mitigation ideas in Large Language Models are not Fair Evaluators.

2. Use Structured Rubrics, Not Vague Prompts

Instead of:

Is this response good?

Use:

  • policy_adherence
  • grounded_in_retrieved_context
  • requested_missing_info_before_action
  • escalated_when_required

This does not remove bias, but it makes the bias easier to detect and discuss.

3. Keep Objective Checks Deterministic

Do not use an LLM judge for facts you can compute directly.

Good deterministic checks:

  • exact tool selected
  • required tool argument present
  • output parses against schema
  • policy version matches expectation
  • escalation flag matches expectation

Reserve the judge for questions humans would also evaluate semantically.

4. Pass Evidence or Traces, Not Only Final Text

If you want to judge:

  • factual grounding
  • correct tool use
  • correct escalation
  • policy adherence

then the judge should see the evidence trail.

That means passing:

  • retrieved passages
  • tool calls
  • tool outputs
  • policy IDs
  • intermediate reasoning artifacts where appropriate

This is exactly why trace-aware evaluation is more informative than pure output grading.

5. Calibrate Against Humans

OpenAI’s Evaluation best practices explicitly recommends maintaining agreement with human feedback. That should not be treated as optional.

A healthy workflow looks like this:

  • sample a slice of eval cases
  • have humans label them
  • compare judge labels against humans
  • revise the rubric or judge model if agreement is poor

Without that loop, it is easy to optimize against a judge that your users would disagree with.

6. Use Cross-Family Model Juries When Bias Matters

The Replacing Judges with Juries paper is especially practical for teams that worry about model-family bias. Their result suggests that a panel of smaller, diverse judges can outperform a single large judge and reduce intra-model bias while also costing less.

That does not mean every team needs a model jury on day one.

It does mean that if your evals are making expensive decisions:

  • model migrations
  • release gates
  • compliance-sensitive scoring
  • major prompt policy changes

then a cross-family panel is often a smarter design than a single favored judge.

7. Escalate Uncertain Cases

Not every case deserves the same automation level.

If a judge is uncertain, if two judges disagree, or if the case touches a high-risk policy surface, escalate it. That is the operational spirit behind Trust or Escalate: do not force full automation where confidence is weak.

In real systems, that can be as simple as:

  • send low-confidence evals to human review
  • require agreement across two judges for release-critical cases
  • route policy-sensitive regressions to a domain expert

A Practical Rollout Checklist

If your team wants to adopt LLM-as-a-judge without creating a measurement trap, I would start here:

  1. Keep unit tests for deterministic logic.
  2. Use scenario tests for realistic multi-turn workflows.
  3. Add judge-based evals for semantic quality, not for objective facts.
  4. Start with one narrow rubric on one critical workflow.
  5. For pairwise evals, always score both answer orders.
  6. Pass traces or evidence whenever correctness depends on tools or retrieval.
  7. Calibrate judge scores against human labels before using them as release gates.
  8. Use a cheaper judge first, but escalate hard cases to stronger judges or humans.
  9. If same-family bias matters, test a small cross-family jury.
  10. Treat judge prompts like code: version them, review them, and revalidate them after changes.

The Real Takeaway

LLM-as-a-judge is not fake, and it is not enough on its own.

The research supports a balanced view:

  • strong judges can be genuinely useful
  • some judging tasks are much more reliable than others
  • the failure modes are well known enough now that teams should plan for them up front

My practical recommendation for Engineers and PMs is:

Use judges to scale evaluation, not to replace judgment.

That means:

  • keep deterministic checks where possible
  • make judges compare or score against explicit criteria
  • give them evidence
  • measure their agreement with humans
  • escalate uncertain cases instead of hiding them in an average score

If you do that, LLM-as-a-judge becomes a very good teammate.

If you do not, it becomes a very convincing spreadsheet generator.

Sources

  1. OpenAI API Docs — Evaluation Best Practices — Product guidance on task-specific eval design, human calibration, and why pairwise comparison and structured scoring are often more reliable than vague open-ended judging. Accessed March 30, 2026.
  2. OpenAI API Docs — Agent Evals — Current platform guidance on reproducible agent evaluation and when to use workflow-level evaluation tools. Accessed March 30, 2026.
  3. OpenAI API Docs — Trace Grading — Explains why grading full traces gives more diagnostic value than black-box output checks for agent systems. Accessed March 30, 2026.
  4. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Found that strong judges can align well with human preferences on some benchmarks, while also documenting position bias, verbosity bias, self-enhancement bias, and reasoning limits.
  5. Large Language Models are not Fair Evaluators — Shows order effects in pairwise evaluation and proposes mitigation strategies including balanced position calibration and human-in-the-loop review.
  6. Large Language Models are Inconsistent and Biased Evaluators — Documents familiarity bias, skewed rating distributions, anchoring effects, and prompt sensitivity in LLM evaluators.
  7. Self-Preference Bias in LLM-as-a-Judge — Analyzes how judges can prefer outputs that feel more familiar to them and connects that bias to perplexity.
  8. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models — Argues for diverse judge panels as a way to reduce intra-model bias and cost.
  9. On Scalable Oversight with Weak LLMs Judging Strong LLMs — Explores the limits of weaker judges supervising stronger systems across multiple task types.
  10. Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement — Useful reference for selective escalation and confidence-aware evaluation design.