theorydelta field guide · v2026.04
built 2026-05-16 findings: 40 task hubs: 6 independent · evidence-traced · no vendor influence

I'M ABOUT TO… · TASK HUB

Evaluating a benchmark

Before citing a public score — what the benchmark actually measures, and what it doesn't.

FINDINGS 2
PHASES 4
LAST UPDATED 2026-04-19
PRIMARY TOOLS 10
EVIDENCE 1 secondary-research · 1 independently-confirmed

PHASES

PHASE · PICK A BENCHMARK

0 findings

No findings yet for this phase.

Coverage gaps are visible by design — Theory Delta surfaces what we haven't tested, not just what we have.

Tell us what you hit here →

PHASE · DECIDE

0 findings

No findings yet for this phase.

Coverage gaps are visible by design — Theory Delta surfaces what we haven't tested, not just what we have.

Tell us what you hit here →

FOR AGENTS · ASK BEFORE YOU WIRE

Your agent can pull this hub's findings as structured JSON.

Same content, machine-shaped. Wire your agent to query Theory Delta before it picks an MCP server, gateway, or framework — not after.

$ td query
td query "is the SWE-bench Verified score I am about to cite still valid?"
→ 2 findings · 4 phases · last updated 2026-04-19
→ stream as Theory Delta MCP tool: list_findings(task="evaluating-a-benchmark")
theorydelta.com · 2026 independent · evidence-backed · every claim sourced or labelled rss · mcp · /scan · llms.txt