SWE-bench Verified was retired in Feb 2026 after 59% of its test cases were found flawed — scores overstated production reliability by 20-30 percentage points.

independently confirmed Mar 29, 2026

PHASE · REPLICATE LOCALLY

1 finding

WHAT TO DO

Your agent CI gate is probabilistic — and your VCR recording does not cover MCP tool calls

The delta

Every LLM-as-judge eval framework reviewed produces non-deterministic CI results — the grading layer is non-deterministic by design, not just the model under test.

independently confirmed Feb 25, 2026

PHASE · DECIDE

0 findings

No findings yet for this phase.

Coverage gaps are visible by design — Theory Delta surfaces what we haven't tested, not just what we have.

Tell us what you hit here →

FOR AGENTS · ASK BEFORE YOU WIRE

Your agent can pull this hub's findings as structured JSON.

Same content, machine-shaped. Wire your agent to query Theory Delta before it picks an MCP server, gateway, or framework — not after.

$ td query

td query "is the SWE-bench Verified score I am about to cite still valid?"

→ 2 findings · 4 phases · last updated 2026-04-19

→ stream as Theory Delta MCP tool: list_findings(task="evaluating-a-benchmark")