CLAIM-FIDELITY AUDITS · AGENTIC TOOLS

We audit claims — we don't review tools.

Every dossier scores a tool's material self-claims for fidelity: claimed → observed → status → evidence. Deltas close only when we re-verify the stated falsification criterion — vendor issue-closure alone never closes a row. Read the dossiers here; your agents read them via MCP.No vendor influence. No paywalled CVEs.

LIVE DOSSIERS

All 5 dossiers →

Hermes Agent2026-07-09

A fast-moving, well-connected personal agent chassis whose headline self-improvement claim is not functional in any released version.

5/14 as labeled6 open deltas

Open the dossier →

Claude Code2026-06-28

Claude Code's capability claims largely hold as labeled, but its enforcement-language claims — deny rules, hook exit-code blocking, managed-policy precedence, shell-operator awareness, and per-provider model-alias resolution — are each contradicted by confirmed silent failures, several of which the vendor closed as stale rather than fixed.

4/14 as labeled11 open deltas

Open the dossier →

LangGraph2026-06-28

LangGraph's orchestration, streaming, and adoption claims hold as labeled, but its durable-execution and persistence label rests on a checkpoint layer whose documented enum and failure handling silently corrupts state — four serialization bugs, a double-interrupt snapshot bug, and a sync/async recovery divergence all re-verified open on 2026-06-28 against the 1.2.6 release line, the entire v1.0.10→1.2.4 span since Theory Delta first published them on 2026-03-29.

3/12 as labeled14 open deltas

Open the dossier →

LiteLLM2026-06-28

The unified-API and provider-breadth claims hold as labeled, but the production-control claims — budgets, rate limits, fallback, and the README's 8ms-P95-at-1k-RPS figure — are contradicted by issue-traced behavior under exactly the concurrent multi-tenant conditions the proxy is marketed for, with the key fixes unmerged and the backing issues stale-bot closed unfixed.

3/13 as labeled11 open deltas

Open the dossier →

Ollama2026-06-28

Ollama's simplicity, model-library, platform, and context-default documentation claims hold as labeled, but its tool-calling label is contradicted where it matters for agents — documented streaming tool support drops tool_calls chunks silently in production, and the get-running-in-minutes agent framing inherits a default context below the vendor's own agentic guidance on common hardware.

6/11 as labeled6 open deltas

Open the dossier →

WHAT ARE YOU ABOUT TO DO?

I'M…5 findings

Setting up MCP servers

row-level security bypass, session destruction on retry, default-open configs

See what's in the way →

I'M…1 finding

Choosing an LLM gateway

silent failure modes — budget counters drift, fallbacks to dead providers, retry cost amplification

See what's in the way →

I'M…4 findings

Picking an agent framework

streaming guardrails broken, frameworks diverging, hook unreliability, checkpoint serialization loss

See what's in the way →

I'M…5 findings

Building a RAG pipeline

three silent failure modes, RAM ceiling data loss, structured generation thresholds, graph memory not production-ready

See what's in the way →

I'M…2 findings

Evaluating a benchmark

SWE-bench Verified retired, agent CI is non-deterministic

See what's in the way →

I'M…5 findings

Configuring agent autonomy

Claude Code hooks unreliable, settings attack surface, error suppression, default-open Goose configs, OTel trace exfiltration

See what's in the way →

FEATURED FINDING · APR 2026

All 59 findings →

The benchmark everyone cited was retired for being wrong.

YOU EXPECT

Vendor SWE-bench Verified scores reflect production reliability and the cases are valid.

WHAT HAPPENS

The benchmark's authors retired it Feb 14. 295 of 500 cases were flawed. 14 vendors still cite the inflated scores.

WHAT IT MEANS FOR YOU

Any selection decision made on a public Verified score is overestimating success by 20–30 percentage points on real tickets.

WHAT TO DO

Stop citing Verified scores in selection. Replicate one of your real tickets on the corrected subset, or use SWE-bench Live.

source-reviewedindependently-confirmedconfidence · high17 sources · 9 gh-issues · 3 papersRead the finding →See the receipts ↗

WHAT THIS IS

A field guide for the agentic tool landscape — structured, opinionated knowledge about what tools actually do. Humans read it here; agents read it via MCP.

We test, we read the issue trackers, we run the tools. Then we publish what we found. Every claim is traced to a primary source or labelled as Theory Delta's own analysis. If a number doesn't come from a primary source, it doesn't appear.

BLOCKS

87 in corpus

Synthesised knowledge — claims, confidence, connections. The asset.

EVIDENCE RECORDS

142 receipts

Per-claim provenance. Source URL, what it actually says, verified date.

PUBLISHED FINDINGS

59 live

Trajectory-changing insight. What you expect, what happens, what to do.

ENGINE PROVENANCE SURFACES

Public, checkable, and linked from the field guide.

TASK → FINDING PATH

Start with what you're about to do, then trace to findings mapped to each phase.

Browse task hubs →

FINDING → RECEIPTS

Each finding ships with publication metadata, evidence type, and linked receipt sections.

Browse findings →

RECEIPTS → PRIMARY SOURCES

The featured finding exposes source-linked receipts so claims can be checked line by line.

Open featured receipts ↗

FINDING → FACT-CHECK READOUT

Fact-check sessions publish corrections and open questions so updates stay auditable.

Open latest readout →

RECENT FINDINGS

Five we shipped this month

All 59 findings →

idtooldeltaevidenceverified

0059OpenAI Codex CLICodex's approval policy doesn't hold across runtimes -- VS Code ignores it, Windows inverts it, and CI auto-approves any mid-session escalation since v0.113.0empirical2026-07-05 0058Claude Desktop (Anthropic)Claude Desktop and Claude Code send the same clientInfo.name — MCP servers cannot tell them apartempirical2026-07-04 0057BerriAILiteLLM's supply chain was compromised and budget enforcement fails silently under concurrent loadempirical2026-06-29 0056MCP Python SDK (StreamableMCP stateful sessions fail with every free-tier load balancer — and neither major client recovers automaticallyempirical2026-06-21 0055OAuth RFC 8693 (IETF)Multi-Agent OAuth Delegation Has No Enforcement Layer — RFC 8693 'act' Claims Are Advisory Onlymedium2026-06-15

FOR AGENTS

Your agent should query Theory Delta before the tool decision, not after.

Findings ship as structured JSON with confidence, evidence type, and source URLs. llms.txt and /.well-known/mcp.json are live for agent discovery.

HTTP · stablellms.txt · live/.well-known/mcp.json

~/.config/agent.json

{
  "mcpServers": {
    "theorydelta": {
      "type": "http",
      "url":  "https://api.theorydelta.com/mcp"
    }
  }
}

$ td query "should I use LiteLLM as a budget gateway?"

→ 1 finding · confidence:high · 11 sources

→ what to do: budgets drift; verify counter behavior or use…