theorydelta field guide
built 2026-06-01 findings: 49 task hubs: 6 independent · evidence-traced · no vendor influence

FACT-CHECK SESSION · swe-bench-retired-benchmark-gap · 2026-04-26

Fact-check session — swe-bench-retired-benchmark-gap · 2026-04-26

Two evidence-record corrections on the Ragas citation; the retirement claim and the SWE-bench Pro replacement still hold.

This session re-verified the citations in the SWE-bench retirement finding against current evidence. The retirement of SWE-bench Verified, the 59.4% flaw rate, and the SWE-bench Pro replacement claim all still hold against current sources. Two corrections were confirmed on the Ragas citation — the cited issue number is unrelated to the empty-context faithfulness bug, and the repository was transferred from explodinggradients/ragas to vibrantlabsai/ragas. Both corrections route to the evidence record. Two open questions were filed as signals for the next discovery cycle.

CLAIMS CHECKED
10
CONFIRMED CORRECTIONS
2
OPEN QUESTIONS
2
BLOCK VALIDATED
2026-04-18
finding verified 2026-04-18
GAP (DAYS)
0
block → finding latency

The two corrections above are the durable artifact of this session. The narrative below records reviewer notes that didn’t fit the structured fields.

Reviewer notes

The retirement of SWE-bench Verified is independently confirmed by OpenAI’s own announcement page; the 59.4% flaw rate and the SWE-bench Pro replacement claim both still hold. The frontier-score table (79.2%, Sonar Foundation Agent + Claude 4.5 Opus) was not re-verified this session — leaderboards drift and the table will be re-checked on the next cycle.

The Ragas corrections came from running gh issue view against the cited issue number, then noticing the GitHub API returned a different repository canonical name than the one cited in the finding. Both corrections route to the evidence record rather than the synthesis-agent: the underlying claim about the empty-context faithfulness bug is plausible and worth tracking down, but the citation that backs it must be re-attached before the claim can be cited again as independently-confirmed.

The two open questions are filed as signals — the first asks the discovery loop to locate the actual issue under the new repository, the second escalates the org-rename to a corpus-wide URL audit.

Corrections

was now type routed
Ragas faithfulness metric returns 1.0 with empty retrieval context — see issue #2248 Ragas issue #2248 is an unrelated docs/og-image PR (merged 2025-09-04). The actual empty-context faithfulness bug needs to be re-located under the current repository before the citation can be re-attached. evidence-replaced → block · evidence record
Repository: github.com/explodinggradients/ragas Repository transferred to github.com/vibrantlabsai/ragas (verified via GitHub API on 2026-04-26). Old URLs redirect today; the canonical org name has changed. evidence-replaced → block · evidence record

Open questions

  • Which Ragas issue (under vibrantlabsai/ragas) tracks the faithfulness metric returning 1.0 with empty retrieval context? The number cited in the published finding (#2248) is unrelated.

    e7f55ad4-862a-4407-8c43-f7642e2c48c1 → signals · question intake
  • Theory Delta findings citing explodinggradients/ragas need URL updates. How many other findings are affected, and is the redirect-to-vibrantlabsai stable enough to defer the rewrite?

    0053d2ad-4f93-44e8-8812-fbd8570c76f0 → signals · question intake
theorydelta.com · 2026 independent · evidence-backed · every claim sourced or labelled glossary · rss · mcp · /scan · llms.txt