FACT-CHECK SESSION · swe-bench-retired-benchmark-gap · 2026-04-26
Fact-check session — swe-bench-retired-benchmark-gap · 2026-04-26
Two evidence-record corrections on the Ragas citation; the retirement claim and the SWE-bench Pro replacement still hold.
This session re-verified the citations in the SWE-bench retirement finding against current evidence. The retirement of SWE-bench Verified, the 59.4% flaw rate, and the SWE-bench Pro replacement claim all still hold against current sources. Two corrections were confirmed on the Ragas citation — the cited issue number is unrelated to the empty-context faithfulness bug, and the repository was transferred from explodinggradients/ragas to vibrantlabsai/ragas. Both corrections route to the evidence record. Two open questions were filed as signals for the next discovery cycle.
The two corrections above are the durable artifact of this session. The narrative below records reviewer notes that didn’t fit the structured fields.
Reviewer notes
The retirement of SWE-bench Verified is independently confirmed by OpenAI’s own announcement page; the 59.4% flaw rate and the SWE-bench Pro replacement claim both still hold. The frontier-score table (79.2%, Sonar Foundation Agent + Claude 4.5 Opus) was not re-verified this session — leaderboards drift and the table will be re-checked on the next cycle.
The Ragas corrections came from running gh issue view against the cited
issue number, then noticing the GitHub API returned a different repository
canonical name than the one cited in the finding. Both corrections route to
the evidence record rather than the synthesis-agent: the underlying claim
about the empty-context faithfulness bug is plausible and worth tracking
down, but the citation that backs it must be re-attached before the claim
can be cited again as independently-confirmed.
The two open questions are filed as signals — the first asks the discovery loop to locate the actual issue under the new repository, the second escalates the org-rename to a corpus-wide URL audit.
Corrections
Open questions
-
Which Ragas issue (under vibrantlabsai/ragas) tracks the faithfulness metric returning 1.0 with empty retrieval context? The number cited in the published finding (#2248) is unrelated.
-
Theory Delta findings citing explodinggradients/ragas need URL updates. How many other findings are affected, and is the redirect-to-vibrantlabsai stable enough to defer the rewrite?