About Theory Delta

Theory Delta is investigative testing of AI developer tool claims. It exists because the gap between what agentic tools claim and what they actually do is large, consequential, and mostly undocumented.

The problem

Agentic tools move faster than documentation can track. New capabilities appear, old assumptions break, and tool authors have every incentive to describe things as they should work rather than as they do work. The result: builders hit production failures that were discoverable, had anyone tested the claims before deployment.

Theory Delta maps the delta between documentation and behavior — not by summarising READMEs, but by studying tools in context and publishing empirical findings about what they actually do.

How findings are made

Each finding traces to a structured knowledge block: a curated set of empirical claims, each tagged with an evidence type. The evidence types are precise, not interchangeable:

runtime-tested — observed in Theory Delta's own execution environment. We ran it.
source-verified — traced to public issue trackers, reproduction steps, or confirmed bug reports. The receipts are public.
independently-confirmed — corroborated by a third party outside the tool author's organisation.
docs-reviewed — drawn from official documentation. Stated as such, not inflated to a stronger claim type.

Finding copy derives from the structured block — the evidence type determines what voice claims are permitted. A source-verified finding cannot say "we tested it." A docs-reviewed finding cannot say "we confirmed." The vocabulary constraint is enforced by the publishing pipeline, not editorial judgment.

Before publication, every finding goes through a fact-check pass against current external sources. Claims that were accurate at block-write time but have since changed are corrected before they go out. The last_verified date on a finding reflects the fact-check date, not when the block was originally written.

Confidence levels

Three confidence levels, no inflation:

empirical — directly observed or traced to a confirmed reproduction. The claim is specific, the evidence is on record.
medium — supported by public evidence but not runtime-tested by Theory Delta. Independently confirmed claims raise the floor here.
low — indicative. Worth tracking but not yet confirmed at a level that supports a strong action recommendation.

Every finding states a falsification criterion — the evidence that would disprove it. If a tool ships a fix, that criterion is the test.

Independence

Theory Delta has no vendor relationships, no sponsored content, and no affiliate arrangements. Findings are not shared with tool authors before publication. Self-corrections are published when evidence changes, not buried. The methodology is open; the blocks and findings are versioned in public repositories.

This is a solo project. One person, no editorial board, no committee. That is a constraint and a feature: findings reflect what was actually studied, not what a team decided was worth studying for strategic reasons.

What Theory Delta is not

Not a security scanner. The scan tool is one view on the underlying knowledge — useful for detecting known failure patterns in a specific configuration — but the knowledge is the product, not the scanner.

Not a benchmark. Benchmarks measure performance on a fixed task. Theory Delta documents failure modes: the things that break when you deploy a tool for a purpose its documentation implies it supports.

Not an analyst firm. Traditional analyst coverage evaluates platforms for enterprise procurement. Theory Delta addresses the practitioner question: does this specific tool, in this specific configuration, do what it claims?

Browse all findings → Methodology detail → Contribute evidence →