Published: 2026-03-29 Last verified: 2026-03-29 empirical
14 claims 1 tested landscape

Agent framework configs are diverging, not converging — and every major framework has production failure modes the docs don’t warn about

From Theory Delta | Methodology | Published 2026-03-29

What the docs say

The leading agent frameworks — LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Pydantic AI, smolagents — present themselves as general-purpose multi-agent orchestration layers supporting major LLM providers and common deployment patterns. Microsoft’s merger of AutoGen and Semantic Kernel is presented as ecosystem consolidation. Framework comparison guides suggest these tools are broadly substitutable for equivalent use cases.

What actually happens

The landscape has fractured. Config patterns are not converging at the orchestration level. Microsoft’s merger is vendor-forced consolidation with a 6-12 month forced migration window — it is not evidence of ecosystem convergence. Every major framework has confirmed production failure modes that are not prominently documented.

The tier structure as of Q1 2026

TierFrameworkStarsStatus
Graph executionLangGraph~26KActive, production-grade
Role-based workflowCrewAI~45KActive, non-deterministic flow
Deprecated/supersededOpenAI Swarm~19KArchived, replaced by Agents SDK
Microsoft consolidationAutoGen~38KMaintenance mode since Oct 2025
Emerging (Go-native)PheroEarly stage
Emerging (TypeScript)VoltAgentEarly stage

The Microsoft consolidation is not ecosystem convergence. Microsoft placed AutoGen in maintenance mode in Oct 2025 — bug fixes and security only, no new features. Migration to Microsoft Agent Framework requires “light refactoring” for single agents and a new orchestration model for multi-agent systems. The migration guide recommends migration within 6-12 months. Projects built on AutoGen now carry forced-migration risk.

Framework-level production failure modes

CrewAI: tool fabrication with non-OpenAI models

Issue #3154 (closed not-planned): With non-OpenAI models, agents generate plausible fake Observation output without executing tools. Phoenix traces confirm zero tool activity. The LLM produces the pattern of a tool result without triggering the tool. Two PRs (#3378, #4077) remain open and unmerged.

This invalidates the cross-provider tool reliability assumption. Any CrewAI workflow using non-OpenAI models must independently verify that tool execution is real — not assumed from log output.

No built-in cycle prevention (LangGraph): A documented production case generated 11 revision cycles burning $4 in API calls before a manual cap was applied. LangGraph provides no automatic cycle prevention. Builders must add revision_count < N state counters manually. This is not in the main LangGraph documentation.

OpenAI Agents SDK: three active failure categories as of March 2026

  1. Handoff incompatibility with server-managed conversations (Issue #2151, targeted for 0.12.x — unfixed)
  2. Tracing infrastructure unreliable: spans silently dropped in long-running workers (#2135), spans not displaying in dashboard despite successful export (#2477), large integer arguments render incorrectly (#2094)
  3. Dynamic tool loading throws ModelBehaviorError for missing tools (Issue #2646, filed March 10 2026)

Fork-after-thread deadlock (patched Feb 17 2026): The tracing module initializes background threads before gunicorn/uWSGI forks worker processes. With preload=true, the fork captures a partially-initialized threading state, deadlocking the worker. Any deployment on the pre-patch release with preload=true is affected — check version before deploying to WSGI servers.

Streaming guardrails are NOT_PLANNED. The SDK’s guardrail system is architecturally incompatible with streaming responses — the guardrail hook runs after the full response is assembled. Marked NOT_PLANNED by maintainers. Builders who need content filtering on streamed output must implement it outside the SDK, at the transport layer.

v0.12.5 fixed MCP retry failures (ClosedResourceError and HTTP 408s) and streamed nested agent output. The three active failure categories above remain open.

smolagents: sandbox bypass (NCC Group PoC)

NCC Group published a working proof-of-concept demonstrating that LocalPythonInterpreter — smolagents’ default sandboxing for local code execution — is bypassable via numpy/pandas import paths. The sandbox blocks direct import os but does not prevent executing shell commands through numpy’s C extension layer.

The sandbox should not be treated as a security boundary for untrusted user inputs. Fully isolated execution (Docker, subprocess with resource limits) is required for any deployment where user-controlled code runs.

Pydantic AI: three independent failure modes

pydantic_evals Python evaluator removed as RCE fix (v1.0 release day): Within 24 hours of shipping, the evaluator was removed as an RCE mitigation. Any builder who shipped against the v1.0 API surface has broken evaluation tooling — the module remains but the evaluator class is gone. The fix requires porting evaluation logic to a non-executing evaluator or a separate sandboxed service.

Parallel MCP tool cancel scope mismatch: When Pydantic AI runs MCP tools concurrently, the async context manager closes before all concurrent calls complete. The result is silent task cancellation — tool calls in-flight at context manager exit are dropped without error. The calling code receives no indication that results are missing. This affects any Pydantic AI + MCP workflow that uses parallelism.

run_stream() and run() divergent tool-handling: Six open issues as of March 2026 track inconsistencies. Tools that execute correctly in run() fail silently or return incomplete results in run_stream(). Any Pydantic AI workflow that mixes streaming and batch tool calls must be tested explicitly in both modes.

Jules (Google): GitHub issue access does not work natively

Practitioner testing confirms Jules explicitly states it cannot access external websites including GitHub. Issue content must be injected into the task prompt manually. The “GitHub integration” covers code access and PR creation — not issue reading or project context retrieval. Any workflow that assumes Jules will pull issue context automatically will proceed without that context, with no error or gap indication.

Mastra: default open access on all tool endpoints

Mastra @mastra/[email protected] ships a pluggable auth system supporting OAuth/SSO and RBAC. The default configuration leaves all tool endpoints open — role configuration is opt-in, not opt-out. Any Mastra deployment that has not explicitly configured roles exposes all tool endpoints without authentication.

n8n self-hosted: no SSRF protection before 2.12.0

n8n’s HTTP Request node had no server-side request forgery protection in self-hosted deployments prior to 2.12.0. Version 2.12.0 introduced configurable SSRF protection. Self-hosted deployments on earlier versions are exposed to SSRF via any workflow that uses HTTP Request nodes with user-controlled URLs. The fix requires upgrading and enabling the protection explicitly — it is not on by default after upgrade.

Where convergence actually exists

Not at orchestration config level. Framework configurations remain entirely divergent. LangGraph uses TypedDict state schemas with add_messages reducers and LangSmith tracing. CrewAI uses @agent and @task decorated methods with role-based instantiation. AutoGen uses GroupChatManager with speaker_selection_method. These are not interoperable.

At one layer only: LLM selection. Developers building the same workflow in all three frameworks independently chose Claude Sonnet as primary with GPT-4o fallback, verbose logging enabled, and tools attached at instantiation rather than runtime. This is the full extent of observed convergence.

At architecture-composition level: Production teams are building LangGraph as the outer control flow and state management layer, with CrewAI crews or AutoGen group chats as inner nodes. LangGraph handles checkpointing and observability; higher-level frameworks provide role-based abstraction where it simplifies code. This hybrid pattern is a convergence in production architecture, not in framework configs.

What to do instead

For CrewAI with non-OpenAI models: Add an independent verification step that confirms tool execution actually occurred before treating Observation output as valid. Phoenix tracing or similar is the diagnostic tool — log tool call events at the framework level, not just completion events.

For LangGraph: Add manual loop counters to all cyclic graphs: revision_count in state, guard condition revision_count < N before nodes that can loop. The framework will not do this for you.

For OpenAI Agents SDK: Pin to v0.12.5+ for MCP retry fixes. Avoid preload=true with gunicorn/uWSGI or ensure you are on the post-Feb-17-2026 patch. Do not depend on streaming guardrails — they are NOT_PLANNED. Implement content filtering at the transport layer if needed.

For smolagents: Do not use LocalPythonInterpreter as a security boundary for user-controlled code. Use Docker or subprocess isolation with explicit resource limits. The NCC Group PoC demonstrates the sandbox is bypassable via numpy/pandas today.

For Pydantic AI: Test streaming and batch tool calls separately — they have divergent code paths with 6 open issues. For any MCP parallel execution workflow, add explicit result validation that all expected results are present (not just that the call returned without error).

For Jules: Pre-load GitHub issue content into the task prompt explicitly. Do not assume GitHub integration covers issue reading.

For AutoGen: Begin migration planning to Microsoft Agent Framework. The 6-12 month window means a Q3/Q4 2026 deadline for any teams on AutoGen today.

For Mastra: Explicitly configure RBAC roles before deploying to any environment beyond local development. Default open access is the ship state.

For n8n self-hosted: Upgrade to 2.12.0+ and explicitly enable SSRF protection. It is not on by default after upgrade.

Environments tested

ToolVersionResult
CrewAIIssue #3154 (Q1 2026)source-reviewed: tool fabrication with non-OpenAI models (#3154 closed not-planned; #3378, #4077 unmerged)
OpenAI Agents SDKv0.12.5source-reviewed: fork deadlock patched Feb 17 2026; streaming guardrails NOT_PLANNED; #2151, #2135, #2646 open
smolagentscurrent (March 2026)independently-confirmed: NCC Group PoC sandbox bypass via numpy/pandas
Pydantic AIv1.0source-reviewed: pydantic_evals evaluator removed as RCE fix; parallel MCP cancel scope mismatch; run_stream/run divergence
Jules (Google)current (March 2026)tested: confirmed no native GitHub issue access; cannot access external websites
n8n self-hosted< 2.12.0source-reviewed: no SSRF protection by default; 2.12.0 release notes
Mastra@mastra/[email protected]source-reviewed: default open tool endpoints documented in release notes
AutoGenmaintenance mode (Oct 2025)source-reviewed: Microsoft migration guide

Confidence and gaps

Confidence: empirical — source-reviewed across 8+ frameworks with issue-level evidence for each failure mode (GitHub Issues, release notes, changelogs verified March 2026). Jules GitHub access claim is tested (practitioner confirmation, not source-reviewed). smolagents sandbox bypass is independently confirmed by NCC Group PoC. SWE-bench retirement independently confirmed by OpenAI’s official retirement announcement.

Falsification criterion: The “diverging, not converging” claim would be disproved by observing a shared configuration standard (schema, SDK, or protocol) adopted by at least three major frameworks (LangGraph, CrewAI, OpenAI Agents SDK) that governs orchestration behavior — not just LLM selection. A shared tool-call validation interface or checkpointing protocol adopted cross-framework would constitute convergence evidence.

ACH lite: Three alternative explanations for the observed fragmentation:

  1. Frameworks are in early-stage competition and convergence will follow market consolidation — possible but not eliminable from current evidence. The Microsoft forced merger is the only consolidation event, and it is corporate, not community-driven. Counter-evidence: five new frameworks entered Q1 2026 (Phero, VoltAgent, obra/superpowers growth, Haystack v2.15.0, Mastra v1.9.0), suggesting fragmentation is accelerating, not converging.
  2. The failure modes are known and documented, just not in the main docs — partially true. CrewAI Issue #3154 is public. NCC Group’s PoC is public. But “documented in a GitHub issue” is functionally equivalent to undocumented for most builders. The claim is specifically about what docs do not prominently warn about.
  3. LLM-selection convergence represents the only meaningful convergence dimension — this is the strongest alternative: if framework-level differences don’t matter because the LLM layer is doing the real work, then config divergence is unimportant. Eliminated by the CrewAI tool fabrication case: the failure mode is framework-level, not model-level. Phoenix traces confirm zero tool activity regardless of model capability.

Devil’s advocate: The strongest case against the core claim: obra/superpowers’ 2x star growth in one month (56.5K → 118.9K) may signal ecosystem consolidation around a specific architectural pattern (two-stage review), not fragmentation. If obra/superpowers becomes a de facto standard, convergence could be happening at the architectural pattern level even if config syntax diverges. Counter: star velocity is not adoption; the two-stage review pattern is not yet implemented across other frameworks.

Open questions: (1) Does the CrewAI tool fabrication bug affect all non-OpenAI providers or only specific ones? Which providers are confirmed unaffected? (2) Is there a published security audit of any framework-bundled Python code execution sandbox that passed? (3) Will the obra/superpowers two-stage review pattern propagate to LangGraph or CrewAI as a built-in feature?

Unverified claims in source block: Vertical stack crystallization (sales GTM, SWE, debugging, forecasting as distinct clusters) is marked unverified in the source. MiroFish 1M agent forecasting and Manus AI Meta acquisition figures are secondary-research/web-research and have not been independently verified against primary sources for this publication.

Seen different? Contribute your evidence — theory delta is what makes this knowledge base work.

Environments Tested

Tool Version Result
CrewAI Issue #3154 (Q1 2026) source-reviewed: tool fabrication with non-OpenAI models confirmed, closed not-planned; PRs #3378 and #4077 unmerged
OpenAI Agents SDK v0.12.5 (March 2026) source-reviewed: three active failure categories; fork-after-thread deadlock patched Feb 17 2026; streaming guardrails marked NOT_PLANNED
smolagents LocalPythonInterpreter current (March 2026) independently-confirmed: NCC Group published working PoC for sandbox bypass via numpy/pandas
Pydantic AI v1.0 (Jan 2026) source-reviewed: pydantic_evals Python evaluator removed as RCE fix within 24h of v1.0 release
Jules (Google) current (March 2026) tested: GitHub issue content must be injected into task prompt; Jules confirms it cannot access external websites
n8n self-hosted < 2.12.0 source-reviewed: no SSRF protection on HTTP Request nodes by default; 2.12.0 adds configurable guard