DeepEval Exfiltrates Trace Data on Import and Langfuse Silent-Drops Non-Generation Spans

From Theory Delta | Published 2026-02-27

What the docs say

DeepEval is documented as an evaluation framework for LLM applications. It is commonly used in CI pipelines to grade agent outputs. Langfuse markets "instrument once, trace everything" with auto-instrumentation for LangChain, LangGraph, and the OpenAI Agents SDK. Both are positioned as observability tools for agent systems.

What actually happens

DeepEval hijacks the global OTel TracerProvider on import. GitHub issue #2497 documents that when deepeval is imported, it registers an exporter that sends trace data to Traceloop's cloud endpoints. This happens regardless of what OTel backend your application has configured. Any application that imports DeepEval in an environment with production trace data — including CI pipelines that run against production databases, staging environments with real user queries, or shared test environments — is at risk of sending that trace data to an external cloud service without explicit opt-in.

This is not a configuration issue. The exfiltration happens at import time. The only safe deployment pattern is to run DeepEval in isolated test environments with no production trace data, and to treat any CI pipeline that imports DeepEval as a data boundary.

Langfuse's "instrument once, trace everything" claim breaks at non-generation spans. In production LangGraph supervisor orchestration, non-generation spans — routing steps, tool decisions, agent handoffs — show empty input and output values unless set_attribute() is called manually. The auto-instrumentation covers direct LLM API calls. It does not automatically capture the inputs and outputs of the orchestration logic between those calls. Users discover this when their traces show the LLM outputs but not the routing decisions that led to them.

The fix is to add set_attribute() calls on non-generation spans explicitly:

from langfuse.decorators import observe
from langfuse import get_client

@observe(name="routing_step")
def route_to_agent(state: dict) -> str:
    langfuse = get_client()
    langfuse.update_current_observation(
        input=state,
        output=next_node
    )
    return next_node

This is undocumented as a requirement for LangGraph supervisor patterns. It surfaces as an operational gap, not a documented limitation.

No observability platform prevents cost runaway mid-run. All reviewed platforms detect token and cost overruns post-hoc in dashboards. None implement real-time blocking at execution time. If an agent enters an infinite retrieval loop or a subagent spawning cascade, the overage happens before it is visible in any dashboard. The only mitigation is application-level: wrap API calls with a token counter and raise an exception at threshold before dispatching to the provider.

Multi-agent tracing across independent processes is structurally unsolved. All platforms trace multi-agent within one process. Cross-process agent calls (separate containers, separate workers) require manually propagating a trace ID and injecting it into the child agent's context. No platform automates this. The exception is LangWatch, which enables HTTP-based trace propagation for cross-process spans — but requires both sides to use LangWatch instrumentation.

What to do instead

For any CI pipeline that uses DeepEval: Isolate it in a test environment with no production trace data. Do not import DeepEval in the same process as production telemetry. Audit existing CI pipelines for unintended data flow between test execution and production tracing backends.

For Langfuse with LangGraph supervisor orchestration: Add explicit set_attribute() or update_current_observation() calls on non-generation spans. Treat the auto-instrumentation as covering LLM API calls only. Budget manual instrumentation time for routing and handoff steps.

For MCP-native tracing: LangWatch is the only platform with explicit mcp_server and mcp_tool_name span fields. Agents self-report via tool calls without SDK wrapping. W&B Weave adds MCP trace logging with a single @weave.op decorator (cloud-only; not viable for air-gap environments).

For cross-process multi-agent tracing: Propagate a trace ID explicitly in the agent call payload and inject it into the child agent's context at instantiation. LangWatch's HTTP-based propagation is the closest to automated; all other platforms require manual propagation.

For cost runaway prevention: Implement at the SDK level. Wrap Anthropic API calls with a running token counter. Raise a budget exception before dispatching when the counter exceeds threshold. Do not rely on dashboard alerts — they arrive after the cost has been incurred.

Environments tested

Tool	Version	Result
confident-ai/deepeval	Feb 2026	OTel TracerProvider hijacked on import; trace data sent to Traceloop cloud (Issue #2497)
langfuse/langfuse	3.x	Non-generation spans show empty input/output in LangGraph supervisor without manual `set_attribute()`
langwatch/langwatch	Feb 2026	`mcp_server` / `mcp_tool_name` span fields capture MCP tool provenance natively; HTTP trace propagation enables cross-process tracing

Confidence and gaps

Confidence: medium — confirmed via GitHub issue with community reproduction for DeepEval; Langfuse gap reported by multiple LangGraph supervisor users; LangWatch MCP fields confirmed via documentation review.

Open questions: Has DeepEval shipped a fix to Issue #2497 that makes the Traceloop exporter opt-in? Does the Langfuse empty-span issue affect all non-generation span types or only LangGraph supervisor-specific handoff patterns? Is LangWatch's cross-process HTTP propagation reliable under high concurrency?

This claim would be disproved by observing: A DeepEval release that does not register a Traceloop exporter on import by default, confirmed by inspecting the global OTel TracerProvider after import and finding no Traceloop endpoint registered. Or a Langfuse release that explicitly documents which span types require manual set_attribute() for input/output visibility and ships auto-instrumentation for LangGraph supervisor handoff spans.

Seen different? Contribute your evidence

Tested this tool yourself? Contribute your evidence -- confirmation, contradiction, or a fix.