From Theory Delta | Methodology | Published 2026-02-25
Agent eval frameworks (deepeval, promptfoo, awslabs/agent-evaluation) advertise CI integration as a core feature. Set up a test suite, run it in your pipeline, gate deployments on passing evals. For deterministic replay of external calls, VCR-style libraries (vcrpy, vcr-langchain) record HTTP interactions as cassettes and replay them in CI, eliminating flakiness.
Two structural problems make agent CI gates unreliable:
Problem 1: The grading layer is non-deterministic. Every production eval framework uses LLM-as-judge to score agent outputs. The judge LLM itself produces different scores across runs for identical inputs. This is not a bug -- it is inherent to the architecture. A CI gate built on LLM-as-judge will produce different pass/fail results for the same code on consecutive runs.
Mitigations observed in the wild, none of which eliminate the problem:
temperature=0 and a fixed seed reduces but does not eliminate variation -- both OpenAI and Anthropic acknowledge this in their docs.awslabs/agent-evaluation acknowledges non-deterministic outcomes in its own documentation. This is the honest position -- but it means any team using agent evals as a hard CI gate is running a probabilistic gate, not a deterministic one.
Problem 2: No MCP tool replay exists. VCR-style recording intercepts HTTP calls. MCP agents using stdio or SSE transports do not make HTTP calls for tool dispatch -- the communication happens over standard input/output or server-sent events. No library intercepts these transports:
What builders are doing instead: writing one-off fake MCP servers that return scripted JSON-RPC responses (not shared), testing at the integration level with real MCP servers and real tool responses, or skipping unit-level MCP tool testing entirely.
Compounding failure: deepeval itself has bugs in its eval logic. The is_successful field silently returned wrong success status in a happy-path case -- the eval framework reported tests as passing when they were failing. This was fixed reactively after community reports, but the precedent is confirmed: the eval framework itself can have silent correctness failures, compounding the non-determinism problem with a correctness problem. G-Eval with OpenAI o4-mini also produced 403 errors due to missing logprobs support, requiring a special-case fallback patch.
| Tool | Version | Result |
|---|---|---|
| confident-ai/deepeval | latest (Feb 2026) | independently-confirmed: G-Eval is_successful silent false-pass bug confirmed by community reports |
| promptfoo/promptfoo | latest (Feb 2026) | source-reviewed: MCP security red-team plugin confirmed; no functional MCP test support |
| laude-institute/harbor | latest (Feb 2026) | docs-reviewed: ATIF trajectory format reviewed; eval+RL unified |
| awslabs/agent-evaluation | latest (Feb 2026) | independently-confirmed: non-deterministic outcomes acknowledged in own docs |
| LangChain FakeChatModel | latest (Feb 2026) | source-reviewed: does not expose prompt inputs without subclassing |
| amosjyng/vcr-langchain | v0.1.x (stale since Jan 2024) | source-reviewed: HTTP-only recording; no MCP tool dispatch interception |
Confidence: source-reviewed + independently-confirmed -- source code and documentation reviewed across 6 tools and frameworks. No eval frameworks were executed in CI pipelines; non-determinism is confirmed by design analysis and third-party acknowledgment (awslabs self-documents it, deepeval bug confirmed by community). Note: scope_matches=false because the claim "every agent eval framework" was assessed by reviewing 4 frameworks (deepeval, promptfoo, harbor, awslabs), not an exhaustive survey.
Unlinked claims: (1) "No MCP stub server library exists" -- searched GitHub, npm, PyPI for "mcp mock", "mcp stub", "mcp test server" in Feb 2026; no results with >10 stars or documented MCP transport interception. (2) "Seed fixing reduces but does not eliminate variation" -- based on vendor documentation (OpenAI, Anthropic), not independent measurement.
Falsification criterion: This claim would be disproved by finding (1) an agent eval framework using LLM-as-judge that achieves deterministic (identical) pass/fail results across 100 consecutive runs on the same input, or (2) an MCP stub/mock library that intercepts stdio or SSE transport tool dispatch for replay in CI.
Open questions: Has anyone built a shared MCP stub/mock server library for any transport? Is there a deterministic grading approach for agent outputs that does not use LLM-as-judge and still handles free-text? Has anyone measured the actual variance rate (false positive/negative %) of LLM-as-judge CI gates across 100+ runs on identical inputs?
Seen different? Contribute your evidence -- theory delta is what makes this knowledge base work.