theorydelta field guide
built 2026-06-01 findings: 49 task hubs: 6 independent · evidence-traced · no vendor influence

AI code review tool detection rates vary by an order of magnitude — architecture determines the ceiling, not model quality

Published: 2026-05-15 Last verified: 2026-05-06 empirical
Staleness risk: high — facts in this subject area change quickly between releases. Re-check the specific claims against your own environment before acting. (This rates the topic, not whether this page is out of date.)

AI code review tool detection rates vary by an order of magnitude — architecture determines the ceiling, not model quality

What you expect

AI code review tools detect bugs at a roughly consistent rate across providers. Picking one is largely a matter of pricing and workflow fit; any well-resourced vendor gets close to the performance ceiling. The Sweep-era vision of autonomous PR generation from issues would mature into a standard product category alongside PR review bots.

What actually happens

Detection rate spread is 14x across tools

The Greptile July 2025 benchmark tested 5 tools against 50 real-world bugs from production codebases across Python, TypeScript, Go, Java, and Ruby repos. Results with default settings:

ToolOverallCriticalHighMedium+Low
Greptile82%58%100%88%
BugBot58%58%64%58%
Copilot54%50%57%55%
CodeRabbit44%33%36%55%
Graphite6%17%0%6%

The 6%–82% spread reflects different context access strategies, not model quality. Tools with full-repo context (Greptile) substantially outperform tools processing diff-only context. Larger models applied to diff-only context will not close this gap.

Greptile runs this benchmark and is itself a vendor. No independent benchmark ecosystem exists.

Structural noise at 10:1

Practitioners report AI review tools generate approximately 10 speculative or low-value comments for every 1 actionable finding. This ratio is consistent across tools. Greptile published “There is an AI Code Review Bubble” (January 24, 2026) addressing the differentiation challenge, though it does not explicitly quantify the 10:1 ratio.

The noise has a compounding effect: developers learn to ignore AI review comments, which means the 1-in-10 real issues get ignored alongside the noise.

The noise is partially structural — baked into training data. arXiv:2502.02757 (MSR 2025) found LLM-based cleaning of code review training datasets achieves 66–85% precision in identifying valid comments. Models fine-tuned on cleaned data generate comments 12.4–13.0% more similar to valid human feedback than models trained on uncleaned datasets. Better prompts and larger models will not eliminate the training data floor.

Non-determinism across runs

The same PR reviewed on two separate runs produces different comment sets. This means AI code review cannot function as a reliable CI gate. It is advisory at best.

Sweep is dead and the category split is permanent

Sweep (sweepai/sweep) — the highest-profile autonomous PR generation tool — was abandoned mid-2024. As of May 2026: ~7.7K stars (often inflated to 30K in secondary sources), 241 open issues, last substantive commit June 2024. The team pivoted to a JetBrains IDE plugin; the GitHub App has no maintainer engagement.

Sweep’s failure exposed a structural problem: generating PRs from issue descriptions requires an interactive agent loop with execution capability and human checkpoints. Sweep was a stateless GitHub App — the opposite architecture. The tools that now fill the issue-to-PR niche (Codex, Claude Code in CI, Devin) all require interactive loops.

The “AI code review” label now covers two incompatible product categories:

PR review bots (comment on existing PRs): CodeRabbit (632K PRs reviewed in 2025, self-reported), GitHub Copilot (561K PRs, self-reported), Cursor BugBot, Greptile. These cannot generate PRs.

Autonomous coding agents (generate PRs from instructions): Codex, Claude Code + GitHub Actions, Devin. These do not review existing PRs.

No tool spans both. This split will not reconverge — the architectures are incompatible.

Multi-stage filtering is the structural answer to noise

All effective noise-reduction approaches layer multiple independent passes:

  • Cursor BugBot: 8 parallel passes with randomized diff order → majority voting → bucket merge → category filter → validator model → dedup against previous runs (V1–V11, 52% → 70% resolution rate)
  • Ellipsis: dedup filter → confidence filter → hallucination filter (cross-references against actual code)
  • CodeRabbit: path_filters + path_instructions context injection + per-repo learnings + review profile

Single-pass architectures (one LLM call, no filtering) cannot match this regardless of model quality.

What this means for you

Tool choice determines whether you catch bugs, not whether you have AI review. The 6%–82% range means a team running the bottom-quartile tool is operating with false confidence — they have an AI reviewer that catches 6% of real bugs while generating noise at 10:1. The gap between “using AI code review” and “using AI code review that works” is a 14x difference in detection rate (6% vs 82%).

You cannot use AI code review as a CI gate. Non-determinism across runs means the same bug will be flagged in one run and missed in another. It is an advisory layer, not a quality control gate.

Qodo/PR-Agent is not execution-capable at PR time. The REQUIRE_TESTS_REVIEW config flag controls review for presence of tests, not execution. CodiumAI’s test-generation capability is IDE-based at write time, not PR review time. Teams expecting Qodo to run tests and flag failures at review time will find it does not.

Resolution rate is a better metric than detection rate. Cursor BugBot’s V1-to-V11 trajectory (52% → 70% resolution rate using it as the optimization target) is the only published falsifiable metric for AI code review improvement. Teams tracking comment volume or comment acceptance are optimizing a proxy that doesn’t predict whether bugs get fixed.

What to do

  1. Identify which category your tool is in before evaluating it. PR review bots (CodeRabbit, Copilot, BugBot, Greptile) comment on existing PRs. Coding agents (Codex, Claude Code + CI, Devin) generate PRs from specs. Using a review bot to replace a coding agent, or vice versa, is a category error.

  2. Choose tools with full-repo context access if detection rate matters. Diff-only tools top out around 54–58%. Full-repo context tools reach 82% in the same benchmark. The gap is architectural — upgrading models won’t close it.

  3. Enable multi-stage filtering or configure noise thresholds aggressively. Single-pass tools generate 10 speculative comments per real issue. Tools with majority voting + validator + dedup (BugBot model) reduce noise structurally. For CodeRabbit: configure path_filters and path_instructions to focus scope. Unfiltered AI review is worse than no AI review — developers learn to ignore the channel.

  4. Do not use AI code review as a CI gate. Non-determinism across runs means the same bug will be flagged in one run and missed in another. Treat it as an advisory signal, not a quality control checkpoint.

  5. Track resolution rate at merge time, not comment volume. “Did engineers actually fix what was flagged?” — judged by LLM at merge — is the only metric that maps to real bug prevention. Comment acceptance rate, reaction counts, and detection rate are all proxies that can be optimized independently of bug reduction.

  6. For Qodo/PR-Agent users: REQUIRE_TESTS_REVIEW does not run tests. It controls review for presence of tests in the diff. If you need PR-time test execution, you need a separate CI step, not a Qodo config flag.

Falsification criterion: This finding would be disproved by a publicly verified independent benchmark (not run by a vendor) showing two or more AI code review tools achieving comparable detection rates across diverse codebases, demonstrating that architecture differences do not drive the observed spread.

Evidence

ToolVersionEvidenceResult
GreptileJuly 2025 benchmarksource-reviewed82% overall detection on 50 real-world bugs; highest of 5 tools (vendor benchmark)
BugBot (Cursor)V1 (July 2025) – V11 (Jan 2026)source-reviewedResolution rate 52% → 70% using resolution rate as optimization target; 40 major experiments across V1-V11
GitHub Copilot code reviewGA April 2025docs-reviewedComment-only reviews; no Approve/Request Changes; no auto-merge capability; 54% detection in Greptile benchmark
CodeRabbitJuly 2025 benchmarksource-reviewed44% overall detection; 632K PRs reviewed in 2025 (self-reported)
GraphiteJuly 2025 benchmarksource-reviewed6% overall detection — lowest of 5 tools tested
Sweep (sweepai/sweep)Reviewed May 2026source-reviewed~7.7K stars, 241 open issues, GitHub App abandoned mid-2024; pivoted to JetBrains plugin
arXiv:2502.02757MSR 2025source-reviewedLLM cleaning of code review training data: 66–85% precision; cleaned-data models 12.4–13.0% closer to valid human feedback
Qodo/PR-Agent configReviewed 2026-05source-reviewedREQUIRE_TESTS_REVIEW is a boolean under [pr_reviewer]; controls test presence analysis, not test execution

Confidence: empirical — 5 tools and 8 sources reviewed. No independent benchmark exists for the primary detection rate claims; the Greptile benchmark is vendor-run. The 10:1 noise ratio is practitioner consensus without an explicit published source. The training dataset finding (arXiv:2502.02757) is independently published.

Strongest case against: The entire detection rate spread may reflect benchmark design artifacts rather than real-world differences. Greptile’s benchmark uses 50 bugs from 5 repos — a narrow sample that may favor Greptile’s full-repo context approach for the bug types selected. Tools optimized for different bug classes (security vs. style vs. logic) might rank differently on a more diverse benchmark. The absence of an independent benchmark makes this impossible to rule out.

Open questions: Would an independent benchmark (not run by a vendor) confirm the 6–82% spread, or narrow it? Does the noise ratio vary systematically with codebase size, language, or tool configuration? Has BugBot’s post-V11 resolution rate (with learned rules, April 2026) improved beyond the 70% reported at V11?

Seen different? Contribute your evidence — share a repro or counter-example and we’ll review it against this finding. Reader evidence is what keeps these findings accurate.

theorydelta.com · 2026 independent · evidence-backed · every claim sourced or labelled glossary · rss · mcp · /scan · llms.txt