Findings

What agentic tools actually do, not what their docs claim. Each finding is backed by tested claims and linked evidence.

Agent Frameworks Diverging Not Converging

Mar 29, 2026 empirical

The multi-agent framework landscape is not converging — it has fractured into tiers with distinct maturity profiles. The one apparent consolidation (Microsoft merging AutoGen + Semantic Kernel) is vendor-forced, not ecosystem-driven. Every major framework has confirmed production failure modes that docs do not prominently warn about.

Agent Security Landscape Four Tiers

Mar 29, 2026 medium

The agent security landscape is described as a single category in most ecosystem maps. In practice it has split into four architecturally distinct tiers with no tool spanning more than one.

Agentic Error Suppression

Mar 29, 2026 medium

Common assumption: error suppression patterns like 2>/dev/null and || true are a minor code smell. Actual finding: these patterns are structurally incompatible with agentic execution because agents have no alternative feedback channel — a human sees the terminal, notices unexpected behavior, and investigates; an agent sees exit code 0 and proceeds. The Tools Fail paper (arXiv:2406.19228) measured the delta: GPT-3.5 accuracy dropped from 98.7% to 22.7% (76pp) when tool errors were silent vs explicit.

Chromadb Ram Ceiling Data Loss

Mar 29, 2026 medium

Documentation presents LRU cache as automatic memory management, but real-world users report memory exceeding configured limits before eviction. Memory is not released after collection deletion, and connection leaks cause container memory to grow monotonically until OOM. The broader theory delta: ChromaDB is presented as a production vector database but its single-node RAM-bound architecture makes it a prototyping tool with a hard scaling ceiling.

Langgraph Checkpoint Serialization Silent Loss

Mar 29, 2026 empirical

GraphRAG v3 (Jan 2026) trades NetworkX for a DataFrame-based pipeline and ships a performance regression vs v2 (issue #2250). LangGraph serialization fails closed silently across four documented modes since Jan 2026 — checkpoint round-trips are lossy for non-primitive types, with no exception raised.

Swe Bench Retired Benchmark Gap

Mar 29, 2026 secondary-research

Benchmark scores overstate production reliability by 20-30 percentage points due to benchmark contamination, single-pass measurement, and flawed test cases — SWE-bench Verified was retired in Feb 2026 after 59.4% of its test cases were found to be flawed.

Goose Default Config Security

Mar 17, 2026 validated

Goose's defaults -- autonomous mode, no extension allowlist, disabled injection detection, 1000-turn ceiling -- each removes a guardrail that would contain the others.

Claude Code Settings Attack Surface

Mar 14, 2026 empirical

Two CVEs and five supply-chain vectors share one pattern: project-scoped settings execute with user privileges before trust verification.

Claude Code Hooks Unreliable Enforcement

Mar 3, 2026 empirical

Five categories of hook failures -- silent non-firing, ignored decisions, platform breakage, data corruption, and architectural constraints -- mean defense-in-depth across multiple events is required.

Llm Gateway Silent Failures

Mar 1, 2026 empirical

LiteLLM gateway features fail silently under production conditions -- budget counters drift, guardrails pass, cache misses go unreported, fallbacks route to dead providers. Every claim traced to a public GitHub issue.

Agentic Rag Three Silent Failures

Feb 27, 2026 empirical

GraphRAG's entity deduplication has a fatal bug — entities with identical names but different types are merged, corrupting multi-hop reasoning. LangGraph conditional edge routing corrupts silently via a Python dict literal footgun with no static warning. Any agent framework with a hard step cap can return raw tool output to users when the cap triggers mid-retrieval; no framework documents this or provides a built-in mitigation.

Deepeval Exfiltrates Traces Via Otel Hijack

Feb 27, 2026 medium

Langfuse markets 'instrument once, trace everything' — in practice non-generation spans (routing steps, tool decisions, agent handoffs) require manual set_attribute() calls for input/output visibility; users running LangGraph supervisor orchestration report empty values on agent input spans. DeepEval hijacks the global OTel TracerProvider on import, exfiltrating trace data to Traceloop cloud regardless of configured backend.

Graph Memory Self Hosted Not Production Ready

Feb 27, 2026 empirical

The claim 'no graph memory tool is production-ready' was correct for self-hosted deployments but requires precision: Graphiti self-hosted has a critical async event loop conflict not documented in the README; Mem0 OSS graph features are OpenAI-only despite 47k stars implying general availability; the official MCP memory server corrupts files under concurrent access.

Agent Testing Non Deterministic Ci

Feb 25, 2026 independently-confirmed

All eval frameworks reviewed (4) that use LLM-as-judge for CI gating produce non-deterministic pass/fail results -- the grading layer is non-deterministic by design, not just the model under test. No deterministic replay tool found for MCP/tool-calling agents (searched GitHub/npm/PyPI Feb 2026). VCR-style recording works at HTTP but does not intercept MCP tool dispatch.

Mcp Supply Chain Security Institutionally Confirmed

Feb 25, 2026 independently-confirmed

MCP supply chain security was classified as 'emerging' -- two enterprise acquisitions in 90 days (Snyk/Invariant Labs June 2025, Docker/MCP-Defender September 2025) change the classification to institutionally confirmed. The rug-pull gap is partially closed by mcp-scan tool pinning, but mid-session mutation remains open.

Mcp Database Servers Security Bypass

Feb 24, 2026 independently-confirmed

Connecting an agent to a database is framed as a correctness problem. It is primarily a security problem -- every MCP database server reviewed uses startsWith('select') as its read-only guard, which is bypassable.

Mcp Stateless Http Silent Feature Loss

Feb 24, 2026 independently-confirmed

Stateless HTTP mode silently disables sampling and elicitation -- protocol-level constraint, not a library bug. Current stateless pattern is explicitly a workaround; spec redesign targeting June 2026.