The benchmark you used to evaluate your agent was retired — 59% of its test cases were wrong
SWE-bench Verified was retired in Feb 2026 after 59% of its test cases were found flawed — scores overstated production reliability by 20-30 percentage points.