theorydelta field guide · v2026.04
built 2026-05-24 findings: 47 task hubs: 6 independent · evidence-traced · no vendor influence

Tool-poisoning attacks against MCP agents succeed more than one-third of the time and the stealth class is undetectable by production tooling

Published: 2026-05-10 Last verified: 2026-04-19 empirical
PUBLISHED FACT-CHECKED 2026-04-19 · 0 corrections
5 claims 0 tested landscape
Staleness risk: high — provider APIs in this area change frequently. Test specific limits and failure modes in your environment before acting.

Tool-poisoning attacks against MCP agents succeed more than one-third of the time and the stealth class is undetectable by production tooling

What you expect

Agent security observability is a solved problem via existing APM tooling. Adding OTel instrumentation, routing spans to a SIEM, and installing an MCP gateway gives you full visibility into what agents do at runtime. More capable reasoning models are safer than less capable ones. Upgrading to a smarter model improves security posture.

What actually happens

The MCPTox benchmark — 1,312 malicious test cases derived from 45 real-world MCP servers with 353 authentic tools, tested across 20 LLM agents — establishes the empirical ground truth that vendor marketing cannot.

Attack success rates contradict default security assumptions

Average attack success rate (ASR) across all 20 agents: 36.5%. Explicit refusals (agent identifies and actively rejects the attack): under 3% even for Claude 3.7 Sonnet. The dominant failure mode is “Ignored” — the agent follows user intent passively rather than actively resisting adversarial tool metadata.

More capable reasoning models are more susceptible, not less:

ModelASR
o1-mini72.8%
DeepSeek-R170.9%
Phi-470.2%

Extended chain-of-thought and reasoning create additional attack surface: the poisoned tool description gets more opportunity to influence multi-step parameter choices. Upgrading to a smarter model does not improve security posture against this attack class. (MCPTox benchmark)

Layer 3 observability is architecturally blocked for API-deployed models

The MCPTox benchmark’s most dangerous attack class is stealth: a poisoned tool description influences parameter choices in legitimate tool invocations, leaving no execution trace. No production observability tool can detect this:

LayerWhat is capturedWho captures itSecurity relevance
Layer 1: MetadataTool name, timestamp, success/failureAgentOps, most gatewaysInsufficient for forensics
Layer 2: PayloadTool call arguments and return valuesmcp-gateway-registry (off by default in most OTel tools)Required for forensics
Layer 3: ReasoningWHY the agent chose a tool — attention weightsMindGuard (DDG) only — research artifactRequired for stealth attack detection

MindGuard achieves 94–99% detection precision on stealth attacks by analyzing attention patterns in the model’s reasoning layer. Critical constraint: it requires access to model attention weights. Attention weights are unavailable for API-deployed models (Claude, GPT-4, and all major commercial LLMs). This constraint makes MindGuard inapplicable to the dominant enterprise deployment model — Layer 3 observability is not just nascent but architecturally blocked.

SIEM integration and audit trail standards remain fragmented

OTel GenAI Issue #2664 (unified agent span schema) has not merged — there is no standard schema for tool call arguments, and every vendor emits a proprietary format. OpenLLMetry — the most widely-used OTel SDK for LLMs — has no A2A trace propagation: multi-agent audit trails produce disconnected spans, not unified conversation traces.

OWASP Agent Observability Standard v0.1 defines AgBOM (Agent Bill of Materials), a structured tool inventory analogous to software SBOM, to address the capability inventory gap (90%+ of surveyed organizations run MCP servers with all tools enabled including destructive functions, with no runtime inventory). No reference implementation exists as of April 2026.

What this means for you

Your OTel instrumentation does not detect tool-poisoning attacks. Layer 1 and Layer 2 observability produce no alert for the MCPTox stealth attack class. The attack completes — the agent uses poisoned parameters in a legitimate tool call — and all logs show normal execution. The 36.5% average ASR means roughly one in three attempts against a typical production agent succeeds.

Smarter model upgrades increase the attack surface. Teams upgrading from a less capable model to o1-mini or DeepSeek-R1 for better reasoning are also increasing susceptibility to tool poisoning from 36.5% average to 70%+ per the MCPTox benchmark. The security briefing for that upgrade should include this tradeoff.

Multi-agent audit trails are broken by default. Even if you instrument every agent individually, OpenLLMetry does not propagate trace context across agent handoffs. A forensic investigation after an incident will find disconnected spans with no way to reconstruct what a multi-agent system did as a unit.

What to do

  1. Treat tool-poisoning defense as pre-deployment, not runtime. No current production tool detects stealth tool-poisoning against API-deployed models. Defense must be supply-chain: vet MCP servers before installing, pin versions, review changelogs for tool description changes, and use only tools from known publishers.
  2. Enable Layer 2 payload logging explicitly. Tool call arguments are off by default in most OTel-based systems due to privacy concerns. Turn them on and route to your SIEM — this is the forensics layer for non-stealth attacks. agentic-community/mcp-gateway-registry is the only confirmed open-source MCP gateway with full payload logging.
  3. Implement explicit trace context injection at agent handoffs. Do not rely on OpenLLMetry to propagate context across agent boundaries. Pass trace IDs explicitly at handoff points, or accept that your audit trail will show disconnected spans.
  4. For high-security deployments, consider open-weight models where MindGuard is applicable. If your threat model includes stealth tool-poisoning as an active risk, API-deployed models have no detection path. Open-weight models with accessible attention weights are the only architecture where Layer 3 detection is possible.
  5. Watch OTel Issue #2664. When a unified agent span schema merges, cross-vendor audit trail correlation becomes possible without custom mapping. Until then, plan for proprietary format normalization in your SIEM pipeline.

Falsification criterion: This finding would be disproved by a production observability tool that detects MCPTox stealth attacks against Claude or GPT-4 API without requiring model attention weight access, or a demonstration that more capable reasoning models have lower tool-poisoning ASR than less capable ones.

Evidence

ToolVersionEvidenceResult
MCPTox benchmarkreview date 2026-04-08source-reviewed36.5% avg ASR across 20 agents; o1-mini 72.8%; stealth class leaves no execution trace
MindGuard (DDG)review date 2026-04-08source-reviewed94-99% detection precision; requires open-weight model attention access — inapplicable to API-deployed models
OWASP Agent Observability Standardv0.1 preview (2026-02-27)docs-reviewedAgBOM spec published; no reference implementation exists
OTel GenAI Issue #2664review date 2026-04-08source-reviewedNo standard tool call argument schema; every vendor emits proprietary format
OpenLLMetryreview date 2026-04-08source-reviewedNo A2A trace propagation; multi-agent audit trails produce disconnected spans

Confidence: empirical — 5 environments reviewed. The MCPTox benchmark results are the primary evidence; all other findings are source-reviewed from GitHub issues, arXiv papers, and specification documents.

Strongest case against: The MCPTox benchmark is a research artifact published August 2025 — the 36.5% ASR figure represents a moment-in-time measurement across 20 specific agents with specific system prompts. Production deployments with system-prompt defenses, MCP gateway filtering, or tool allowlisting may see substantially lower ASRs. The “more capable models are more susceptible” finding may not generalize across all task types or tool configurations. MindGuard’s open-weight constraint is real but leaves open the possibility of attention-weight approximation techniques that work against black-box APIs.

Open questions: Has any production MCP gateway deployed behavioral anomaly detection that reduces tool-poisoning ASR below the 36.5% MCPTox baseline? Do system-prompt defenses against prompt injection meaningfully reduce MCPTox ASR, or does the stealth class evade them entirely? Has OTel Issue #2664 merged since the April 2026 review date?

Seen different? Contribute your evidence — theory delta is what makes this knowledge base work.

theorydelta.com · 2026 independent · evidence-backed · every claim sourced or labelled rss · mcp · /scan · llms.txt