AI Quality & Evaluation

How to Build a Good Agentic Code Reviewer

Most AI code review bots fail for a simple reason: they optimize for visible comments instead of reviewer trust. This guide pulls together current benchmarks, practitioner reports, product limitations, and design patterns for building a code review agent that is fast, quieter, less biased, and less hallucination-prone.

23 min read

Every team wants the same thing from AI code review: catch real problems early, give useful feedback fast, and reduce load on human reviewers.

What many teams get instead is something noisier:

  • comments that sound plausible but are wrong
  • repeated feedback on already-addressed points
  • style nits mixed with real defects
  • too much confidence on low-context diffs
  • pressure to “say something” on every pull request

That last point matters more than it looks.

My current read, after reviewing the latest papers, product docs, and field reports as of April 15, 2026, is that the best code review agent is not the one that comments the most. It is the one that knows when to stay silent, grounds every claim in evidence, and optimizes for reviewer trust over visible activity.

The research is directionally clear:

  • frontier models are still far below expert human review on real PR benchmarks
  • more context does not automatically improve review quality
  • concise, targeted comments are more likely to change code
  • hallucination filtering and critic-style pipelines help
  • teams do get value from AI review, but only when noise stays low

So the real design question is not:

“How do I make a bot that always comments?”

It is:

“How do I build an agentic reviewer that is selective, grounded, fast enough to matter, and useful enough that engineers keep trusting it?”

If you want adjacent reading from this repo, two good companion pieces are:

TL;DR

  • As of April 15, 2026, the newest public PR-review benchmarks still show frontier models catching only a minority of human-flagged issues on realistic pull requests.
  • The strongest review systems are not pure “one prompt over the whole diff” systems. They are multi-stage systems with triage, targeted context retrieval, candidate critique generation, grounding checks, and a strong no-comment path.
  • Good models for code review are not just good coding models. They need strong multi-file comprehension, low false-positive behavior, concise explanation, and a willingness to abstain.
  • Bad code and bad PR shape make both humans and LLMs worse at reviewing. Large, mixed-scope, over-abstracted diffs create attention dilution and low-signal comments.
  • Existing bots often feel bad because they are biased toward easy, local, line-level comments and weak at contextual issues spread across files, tests, and architecture.
  • A good review agent should optimize for precision, implementation rate, and trust, not raw comment count.
  • If I had to pick one principle to anchor the whole system, it would be: no evidence, no comment.

What Code Review Is Actually For

Before deciding how an agent should review code, it helps to restate the job.

Good code review is not only about finding bugs. It usually does at least five things:

  1. catches correctness or security problems before merge
  2. protects code health and maintainability
  3. checks tests, edge cases, and failure handling
  4. transfers knowledge between author and reviewer
  5. creates a written record of why the change is safe enough to merge

That is why weak bots feel frustrating even when some of their comments are technically correct. They often participate in the conversation without actually helping the team do the job above.

The older large-scale review literature already showed that review coverage is associated with better software quality and fewer security issues. More recent work on understandability shows something else important: a huge portion of real review effort is about making code easier to read and maintain, not just catching runtime defects.

So a good agentic reviewer should not behave like a static analyzer wearing an LLM costume. It should behave more like a careful junior-to-mid reviewer with strong tooling support:

  • it should surface risky issues
  • distinguish important comments from optional ones
  • explain why the point matters
  • avoid pretending certainty it does not have

What the Latest Evidence Says

This is the part many product demos skip.

Benchmarks: frontier models still miss a lot

The most important recent benchmark I found is SWE-PRBench from March 27, 2026. It evaluates code review quality on 350 pull requests with human-annotated ground truth. The headline result is sobering: across eight frontier models, they detect only 15% to 31% of human-flagged issues on the diff-only setup. The paper also reports that all eight models got worse when given more context in richer configurations, which the author attributes to attention dilution on contextual issues.

That means two things:

  • code review is a different task from code generation
  • “just give the model more repo context” is not a reliable design rule

Another useful benchmark is CodeFuse-CR-Bench from late 2025. It is more repository-level and comprehensiveness-aware. Its authors conclude that no single LLM dominates all aspects of code review, though Gemini 2.5 Pro achieved the highest comprehensive score on their setup.

My inference from those two benchmark families is:

model choice matters, but review-system design matters even more.

If the top tier is tightly clustered and performance still trails expert humans by a wide margin, the way you scope context, filter candidates, and gate comments becomes the real differentiator.

Field studies: usefulness is real, but so is noise

The Automated Code Review In Practice industry study is one of the best reality checks here. It examined 4,335 pull requests across three projects in an industrial environment using an LLM-based review tool. The paper reports that 73.8% of automated comments were resolved, which sounds promising at first glance. But it also found the average pull request closure duration increased from 5 hours 52 minutes to 8 hours 20 minutes, and practitioners mostly reported only a minor improvement in code quality.

That is a very familiar tradeoff:

  • yes, the tool found useful things
  • no, that does not automatically mean the full review workflow got better

Another large-scale study, Does AI Code Review Lead to Code Changes?, analyzed more than 22,000 review comments from 16 AI review GitHub Actions across 178 repositories. Its conclusion is one of the strongest design signals in this whole area: comments were more likely to lead to code changes when they were:

  • concise
  • included code snippets
  • manually triggered
  • coming from hunk-level review tools

That aligns almost perfectly with what engineers say informally: small, specific, well-scoped comments get acted on. Generic review theater does not.

Best-case product stories: fast feedback can be valuable

There are also positive reports, but they are usually from tuned systems and should be read as case studies, not universal laws.

For example, Graphite says its Claude-powered reviewer delivered a 40x faster feedback loop, moving from about 1 hour to 90 seconds, with a 96% positive feedback rate on AI-generated comments and a 67% implementation rate of suggested changes. Because that is vendor-reported, I treat it as evidence that strong results are possible, not as neutral proof that every team should expect those numbers.

Still, it matters for product design:

  • speed matters
  • comment quality matters more
  • teams will tolerate automation when it reliably helps before a human arrives

Why Existing Code Review Bots Often Feel Bad

The user complaint I hear most often is not, “the bot never helps.” It is, “the bot helps sometimes, but I don’t trust it enough.”

That trust problem usually comes from the same set of failures.

1. They act like they must always comment

Your impression is directionally right.

I do not have evidence that every vendor explicitly optimizes for “always leave a comment,” but product behavior often pushes in that direction. GitHub’s Copilot code review docs explicitly say Copilot always leaves a Comment review, and also warn that on re-review it may repeat the same comments again even after they were resolved or downvoted.

That is not the same as “always finds an issue,” but it reinforces the visible-participation pattern:

  • the bot appears in the review
  • the bot says something
  • the UI shows activity

My inference from the literature and product docs is that many tools are still implicitly optimized for review presence, not review precision.

That is backwards.

For trust, the most valuable output is often:

“I inspected this change and I do not currently see a high-confidence issue worth interrupting the author for.”

2. They are biased toward easy local comments

SWE-PRBench is especially useful here because it shows models degrading when more context is added. The lost performance is concentrated in contextual issue detection, not simple line-level spotting.

That means current bots are biased toward:

  • obvious local smells
  • syntax-adjacent issues
  • narrow, line-scoped observations

And weaker at:

  • cross-file invariants
  • missing tests for behavior spread across modules
  • architecture mismatches
  • subtle semantic regressions

This is one reason bots can feel nitpicky while still missing the problem the human reviewer actually cared about.

3. They hallucinate plausible concerns

The HalluJudge paper exists because this problem is serious enough to justify a dedicated safeguard layer. The paper studies hallucinated review comments in real enterprise projects and reports a practical hallucination-assessment approach with F1 0.85 at an average cost of $0.009. It also found that about 67% of HalluJudge assessments aligned with developer preferences in online production.

That is important because it reframes the problem:

  • review generation is not enough
  • review grounding has to be checked too

4. They confuse style guidance with merge-blocking risk

Human reviewers already struggle with this. Google’s review guidance explicitly recommends labeling severity so developers know what is required versus optional. Many bots still collapse everything into one undifferentiated stream of comments.

That creates two bad outcomes:

  • authors overreact to nits
  • serious issues get buried in low-priority noise

5. They forget the reviewer is a human with limited attention

This is the part I think many product designs miss.

A review comment is not free. It costs:

  • attention
  • context switching
  • defensive interpretation
  • implementation time
  • follow-up discussion

So even a technically reasonable comment can be net negative if it is low-priority, weakly grounded, or badly timed.

What Makes a Good LLM for Code Review

When people ask for the “best model” for code review, I think the wrong instinct is to rank models from one benchmark and stop there.

As of April 15, 2026, the better question is:

What model behaviors matter most for review quality in my workflow?

Here is the checklist I would use.

TraitWhy It Matters for Review
Multi-file comprehensionReal PR issues are often spread across implementation, tests, and call sites.
Low false-positive rateA noisy reviewer gets ignored even if some comments are useful.
Concise explanationLong generic comments are less likely to be acted on.
Evidence groundingEvery claim should map back to a diff hunk, nearby code, or retrieved context.
Good abstention behaviorThe model must be willing to say “no high-confidence issue found.”
Stable instruction followingRepo-specific review instructions must actually change the output.
Strong critic behaviorCritiquing is a different skill from generating code.

OpenAI’s CriticGPT work is relevant here even though it is not a PR-review product. It shows that critique-specialized models can help humans catch issues they would otherwise miss, and that critique systems can produce fewer nitpicks and fewer hallucinated problems than a more generic model acting alone.

That suggests an important design principle:

for review, prefer critic-style behavior over author-style behavior.

A great code generator is not automatically a great code reviewer.

So which LLMs look strongest right now?

Based on the latest benchmark evidence I reviewed:

  • frontier reasoning/coding models are the right starting pool
  • Gemini 2.5 Pro looked strongest on the comprehensive CodeFuse-CR-Bench setup
  • the top four models in SWE-PRBench were statistically indistinguishable

So my recommendation is not “pick model X and you are done.”

It is:

  1. start with one or two frontier models that are strong on code reasoning
  2. evaluate them on your own repository history
  3. optimize for precision and implementation rate, not benchmark vanity

If two models are close, prefer the one that gives you:

  • fewer hallucinated comments
  • better abstention
  • better hunk-level grounding
  • lower latency at the review depth you actually need

How Good Code vs Bad Code Changes the Review Itself

This topic deserves more attention than it usually gets.

A code review agent is not reviewing in a vacuum. It is reviewing a specific artifact with a specific shape:

  • the code
  • the tests
  • the diff size
  • the PR description
  • the surrounding repo context

That shape changes review quality for both humans and models.

Good code makes review easier

The paper Understanding Code Understandability Improvements in Code Reviews found that developers spend 58% to 70% of their time reading code, and that over 42% of analyzed review comments focused on improving understandability. It also found 83.9% of understandability-improvement suggestions were accepted.

That tells us something simple but powerful:

readability is not cosmetic review work. It is a large fraction of review work.

Good reviewable code usually has:

  • small, coherent diffs
  • meaningful names
  • tests close to the behavioral change
  • limited abstraction depth
  • clear failure handling
  • a PR description that explains intent and risk

These traits help both humans and bots form a correct mental model faster.

Bad code shifts review into guesswork

Bad reviewability often comes from:

  • huge mixed-scope PRs
  • helper functions introduced for one use only
  • duplicated logic spread across files
  • over-abstraction that hides the real behavioral change
  • weak or missing tests
  • PR descriptions that say almost nothing

The recent Reddit discussion about AI-generated PRs being harder to review describes this failure mode well: the code can look polished line by line while the change as a whole feels bloated, over-abstracted, and hard to reason about. That is anecdotal, but it lines up with what many reviewers now see in AI-heavy workflows.

The key design implication is:

a good review agent should sometimes comment on reviewability itself.

Not as a vague process complaint, but as a merge-risk observation such as:

  • “This PR mixes schema, API, and UI behavior; consider splitting because the test impact is difficult to verify in one pass.”
  • “The behavioral change is clear, but there is no test delta covering the new failure path.”
  • “Three new helper layers were added, but each is only used once; this may increase maintenance cost without reducing complexity.”

That is much more useful than another style nit.

Common Failure Modes I Would Design Against

If I were reviewing the requirements for an AI code review bot today, these are the failures I would explicitly guard against.

1. Comment-count optimization

If the system is rewarded for producing visible comments, it will over-comment.

2. Context explosion

If the system is rewarded for “using all available context,” it may dilute attention and become worse at finding the actual issue.

3. Shallow correctness theater

If the system can sound precise without grounding each claim, it will hallucinate authority.

4. Severity collapse

If the system cannot distinguish must-fix from optional, every comment becomes emotionally expensive.

5. Re-review amnesia

If the system does not remember resolved feedback or reason over the delta since the last pass, it will repeat itself.

6. Single-pass architecture

If one prompt handles triage, retrieval, critique, prioritization, and phrasing, each subtask will be weaker than it needs to be.

7. No abstention path

If the system cannot exit cleanly with “no high-confidence issue,” it will fill the silence with low-value guesses.

8. No learning loop

If dismissals, accepts, edits, and developer feedback do not feed back into evaluation, the system will plateau.

A Better Agentic Design

This is the architecture I would trust more than a plain “review the diff” prompt.

Stage 1: Diff triage

Use a fast model or deterministic heuristics to classify the PR:

  • file types touched
  • estimated risk
  • test delta present or absent
  • security-sensitive paths
  • likely need for deeper contextual review

This stage should decide whether the PR needs:

  • no AI review
  • a fast hunk-level review
  • a deeper targeted review

Stage 2: Targeted context retrieval

Do not dump the whole repository into the prompt.

Retrieve only the context most likely to matter:

  • touched functions and callers
  • nearby tests
  • relevant interfaces and types
  • repo review instructions
  • ownership or architectural guidance for the changed area

SWE-PRBench is the clearest warning sign against naïve context stuffing.

Stage 3: Candidate critique generation

Generate possible comments, but require each candidate to include structured fields such as:

  • claim
  • evidence location
  • why it matters
  • severity
  • confidence
  • suggested next action

This makes later filtering much easier.

Stage 4: Grounding and hallucination filter

Before any comment reaches the human, run a verifier layer.

This can be:

  • a second model
  • a cheaper critic
  • a HalluJudge-style grounding check
  • deterministic evidence checks where possible

A simple but high-value rule is:

reject comments that cannot point to concrete evidence in the diff or retrieved context.

Stage 5: Deduplication and prioritization

Merge overlapping comments and rank the survivors.

The final output should usually contain only a small number of comments, with clear labels such as:

  • Must fix
  • Should fix
  • Consider

That mirrors good human review practice better than a flat stream.

Stage 6: No-comment gate

This is the most important stage.

If no comment passes the evidence, confidence, and priority thresholds, the system should return:

No high-confidence issues found in this pass.

That is a feature, not a failure.

A Practical Comment Schema

If you want fewer biased or ambiguous comments, force the system into a stricter shape.

For example:

FieldRequirement
categorycorrectness, security, tests, maintainability, understandability
severitymust-fix, should-fix, consider
confidencelow, medium, high
evidenceexact file, hunk, or retrieved artifact
whyone short explanation of impact
actiona concrete next step or question

That schema helps in four ways:

  • it reduces vague style commentary
  • it makes the bot’s confidence inspectable
  • it lets you measure which kinds of comments get accepted
  • it makes the UI easier for humans to scan

Google’s reviewer guidance also supports the idea of separating required changes from optional suggestions. In practice, I think many AI tools would feel dramatically better if they just did this one thing well.

How Fast Should an Agentic Reviewer Be?

There is no universal standard here, so this section is partly inference from product behavior and reviewer ergonomics rather than a formal rule from the literature.

My recommendation:

  • for local or uncommitted-change review, aim for under 30 seconds
  • for first-pass PR review, aim for 60 to 120 seconds
  • for deep review on risky PRs, make it manual or explicitly triggered

Why these numbers?

  • GitHub documents that Copilot’s uncommitted-change review in VS Code usually takes less than 30 seconds
  • Graphite’s case study suggests that moving first-pass review into the seconds-to-90-seconds range changes team behavior meaningfully
  • the industrial study shows that if automation stretches overall PR cycle time without enough value, teams feel the drag

So the speed goal is not “as fast as possible.” It is:

fast enough to arrive before the human reviewer, but not so aggressive that quality collapses into noise.

How to Make Comments Less Biased and More Correct

When you say “less bias” in this domain, I think there are three practical meanings:

  1. less bias toward easy style comments over hard semantic issues
  2. less bias toward commenting even when uncertain
  3. less bias toward one team’s habits being treated as universal truth

Here is how I would reduce each one.

Bias toward easy comments

Evaluate separately for:

  • local line issues
  • contextual cross-file issues
  • tests and missing coverage
  • maintainability and understandability

If you only measure aggregate acceptance, the system will learn to farm the easy bucket.

Bias toward speaking when unsure

Reward abstention.

If a low-confidence no-comment outcome is scored as failure, the model will hallucinate comments to avoid silence.

Bias toward one coding style

Constrain style guidance through explicit repo instructions and keep it small enough to be reliably read. GitHub’s Copilot docs note that only the first 4,000 characters of custom instruction files affect code review. That is a useful practical reminder: long hidden prompt novels are not a robust governance strategy.

Bias toward one model family’s instincts

For high-risk repositories, I think a second verifier from a different model family can be a reasonable advanced pattern. I would not use this everywhere because cost and latency go up, but it is a good option when false positives or false negatives are expensive.

What I Would Measure in Production

If you want the system to improve instead of drift, track review quality like a product.

The metrics I would care about most are:

  1. implementation rate: what share of comments led to code changes
  2. dismissal rate: what share of comments authors rejected or ignored
  3. repeat-comment rate: how often resolved issues reappear
  4. no-comment rate: how often the system correctly stays silent
  5. time to first useful feedback: not just time to first output
  6. missed-issue rate against sampled human-reviewed PRs
  7. precision by category: correctness, security, tests, maintainability, readability

If I had to pick one north-star metric, it would be:

high-confidence comment implementation rate, paired with low repeat-comment rate.

That gets much closer to trust than raw volume metrics.

A Good Reading Flow for Engineers: Zero to Advanced

You asked for a flow that works for engineers from zero to advanced. This is the sequence I would recommend for this article itself.

1. Zero to foundational

Read these first:

  • What Code Review Is Actually For
  • What the Latest Evidence Says
  • Why Existing Code Review Bots Often Feel Bad

Why this order:

  • first anchor the purpose of review
  • then calibrate expectations with actual benchmark and field evidence
  • then connect that evidence to the pain engineers already feel

2. Builder level

Then move to:

  • What Makes a Good LLM for Code Review
  • How Good Code vs Bad Code Changes the Review Itself
  • Common Failure Modes I Would Design Against

Why this order:

  • model choice without task understanding is premature
  • review quality depends on PR shape and code clarity, not only the model
  • once those two are clear, the failure modes become obvious

3. Advanced operator level

Finish with:

  • A Better Agentic Design
  • A Practical Comment Schema
  • How Fast Should an Agentic Reviewer Be?
  • How to Make Comments Less Biased and More Correct
  • What I Would Measure in Production

Why this order:

  • architecture comes before optimization
  • output design comes before latency tuning
  • bias and correctness controls come before scaling
  • metrics come last because you should measure the system you actually chose to build

If someone is already deep in agent evaluation, they can safely skim the early framing sections and jump straight into the benchmark section, design section, and production metrics section.

The Core Takeaway

The best code review agent is not a fake senior engineer.

It is a narrow, disciplined system that:

  • knows what kind of review it is doing
  • looks at only the context it actually needs
  • generates candidate critiques
  • verifies those critiques against evidence
  • ranks them by importance
  • and often decides not to comment

The latest research does not support the idea that current LLMs can replace expert human review on realistic pull requests.

It does support a more useful conclusion:

AI code review is valuable when it behaves like a careful first-pass critic, not an always-on oracle.

That means:

  • trust precision more than coverage
  • trust implementation rate more than comment count
  • trust grounded evidence more than polished language
  • trust silence more than low-confidence noise

If you build for those traits, the system has a chance to become a real teammate.

If you do not, it will look busy while making code review harder.

Sources

  1. SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback — March 27, 2026 benchmark showing eight frontier models catch only 15% to 31% of human-flagged issues on realistic PR review, and degrade with more context.
  2. CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects — Repository-level benchmark concluding that no single LLM dominates all aspects of code review, while Gemini 2.5 Pro had the strongest comprehensive performance on that setup.
  3. Automated Code Review In Practice — Industrial study across 4,335 pull requests showing useful comments, but also longer PR closure times and complaints about faulty or irrelevant feedback.
  4. Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions — Large-scale study of 22,000+ AI review comments showing that concise, manually triggered, snippet-rich, hunk-level comments are more likely to lead to code changes.
  5. HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation — Hallucination-filtering approach for code review comments, evaluated on enterprise projects with strong cost-effectiveness.
  6. Finding GPT-4’s mistakes with GPT-4 — OpenAI’s CriticGPT write-up showing that critique-specialized assistance helped humans catch more problems and reduced nitpicks and hallucinated bugs versus weaker critique baselines.
  7. Using GitHub Copilot code review — Current GitHub documentation describing Copilot review behavior, including comment reviews, manual re-review, repeated-comment limitations, instruction-file limits, and sub-30-second uncommitted review expectations. Accessed April 15, 2026.
  8. Why we chose Anthropic’s Claude to power Graphite Reviewer — Vendor case study reporting faster feedback loops and high implementation and satisfaction rates when comment quality is tuned well.
  9. Understanding Code Understandability Improvements in Code Reviews — Study showing that a large share of real review comments aim to improve understandability, and that those changes are usually accepted.
  10. A Large-Scale Study of Modern Code Review and Security in Open Source Projects — Large-scale evidence that stronger code review coverage is associated with fewer quality and security issues.
  11. How to write code review comments — Google’s reviewer guidance on clarity, explaining why, and labeling comment severity.
  12. Small CLs — Google’s argument for small, coherent review units that are faster and easier to review well.
  13. Is AI coding making pull requests harder to review? — Recent community discussion capturing engineer concerns about plausible-but-bloated AI-generated PRs and review overload.