theorydelta field guide
built 2026-06-01 findings: 49 task hubs: 6 independent · evidence-traced · no vendor influence

FIELD GUIDE · AGENTIC TOOL LANDSCAPE

What agentic tools actually do — not what their docs claim.

Empirical intelligence for builders. We test what tools claim against what they actually do, then publish it — every claim traced to a primary source. Read it here; your agents read it via MCP. No vendor influence. No paywalled CVEs.

WHAT ARE YOU ABOUT TO DO?

FEATURED FINDING · APR 2026

All 49 findings →

The benchmark everyone cited was retired for being wrong.

YOU EXPECT
Vendor SWE-bench Verified scores reflect production reliability and the cases are valid.
WHAT HAPPENS
The benchmark's authors retired it Feb 14. 295 of 500 cases were flawed. 14 vendors still cite the inflated scores.
WHAT IT MEANS FOR YOU
Any selection decision made on a public Verified score is overestimating success by 20–30 percentage points on real tickets.
WHAT TO DO
Stop citing Verified scores in selection. Replicate one of your real tickets on the corrected subset, or use SWE-bench Live.
source-reviewed independently-confirmed confidence · high 17 sources · 9 gh-issues · 3 papers Read the finding → See the receipts ↗

WHAT THIS IS

A field guide for the agentic tool landscape — structured, opinionated knowledge about what tools actually do. Humans read it here; agents read it via MCP.

We test, we read the issue trackers, we run the tools. Then we publish what we found. Every claim is traced to a primary source or labelled as Theory Delta's own analysis. If a number doesn't come from a primary source, it doesn't appear.

BLOCKS
87 in corpus
Synthesised knowledge — claims, confidence, connections. The asset.
EVIDENCE RECORDS
142 receipts
Per-claim provenance. Source URL, what it actually says, verified date.
PUBLISHED FINDINGS
49 live
Trajectory-changing insight. What you expect, what happens, what to do.

RECENT FINDINGS

Five we shipped this month
All 49 findings →

FOR AGENTS

Your agent should query Theory Delta before the tool decision, not after.

Findings ship as structured JSON with confidence, evidence type, and source URLs. llms.txt and /.well-known/mcp.json are live for agent discovery.

HTTP · stable llms.txt · live /.well-known/mcp.json
~/.config/agent.json
{
  "mcpServers": {
    "theorydelta": {
      "type": "http",
      "url":  "https://api.theorydelta.com/mcp"
    }
  }
}
$ td query "should I use LiteLLM as a budget gateway?"
→ 1 finding · confidence:high · 11 sources
→ what to do: budgets drift; verify counter behavior or use…
theorydelta.com · 2026 independent · evidence-backed · every claim sourced or labelled glossary · rss · mcp · /scan · llms.txt