LLM gateway features silently fail -- no exception, just wrong behavior

From Theory Delta | Methodology | Published 2026-03-01

What the docs say

LLM gateways like LiteLLM, Portkey, and OpenRouter advertise production-grade features: budget enforcement, fallback routing, prompt caching, guardrails, and multi-provider load balancing. The pitch is simple -- put a gateway in front of your LLM calls and get reliability, cost control, and provider flexibility for free.

What actually happens

Gateway features are probabilistic, not deterministic. They work most of the time, but when they fail, they fail silently -- no exception, no error, just wrong behavior.

8 of 10 recent LiteLLM failures produce wrong output with no exception raised. The failure modes include:

Budget enforcement loses 75-93% of counter increments under concurrent load. Your spend limits silently stop working the moment you have multiple requests in flight. The budget says $50, your actual spend is $200, and the gateway never raises an alert.
Fallback routing ignores provider health. When a provider returns errors, the fallback logic sometimes routes to the same failing provider instead of the next healthy one. No error surfaces to the caller -- you just get repeated failures that look like the LLM is broken.
Prompt caching transformations break silently. Cache hit rates drop to zero when the gateway transforms prompts in ways that change the cache key. The gateway reports cache misses. Your costs double. No error.
Guardrails pass when they should block. Content filtering configured through the gateway intermittently fails to trigger, allowing content through that should be rejected.

LiteLLM has a ~300 RPS ceiling due to the Python GIL. This is not documented. Above this threshold, you get increased latency and dropped budget increments, not an error message.

Bifrost (Rust-based, designed to fix LiteLLM's performance ceiling) has its own 3 confirmed failures. Switching to a faster gateway does not eliminate the silent failure pattern -- it changes which features fail silently.

What to do instead

Treat every gateway feature as unverified until you test it under your actual load. Budget enforcement, fallback routing, and caching all need load testing, not just integration testing.
Add independent monitoring for the features you rely on. If you use budget enforcement, track spend independently. If you use fallback routing, monitor which provider actually served each request. Do not trust the gateway's self-reported metrics.
For high-throughput workloads (>200 RPS), evaluate Bifrost or TensorZero instead of LiteLLM. But test their specific failure modes too.
Keep gateway config minimal. Each feature you enable is a feature that can fail silently. Use the gateway for routing and let purpose-built tools handle guardrails, budgets, and caching.

Environments tested

Tool	Version	Result
LiteLLM	1.55+	8 of 10 failures produce wrong behavior with no exception raised
Bifrost	latest (Mar 2026)	3 confirmed failures in Rust gateway
OpenRouter	latest (Mar 2026)	Routing abstraction reviewed

Confidence and gaps

Confidence: empirical -- failure modes confirmed through runtime testing of LiteLLM under concurrent load. Bifrost failures confirmed through source review and issue tracking.

Falsification criterion: This claim would be disproved by demonstrating that LiteLLM budget enforcement maintains >95% counter accuracy under concurrent load (>50 RPS), or that fallback routing correctly avoids unhealthy providers in all tested scenarios.

Open questions: Does TensorZero avoid the silent failure pattern? What is the actual RPS ceiling for Bifrost before its failure modes appear? Has any gateway implemented end-to-end observability that would surface these silent failures?

Seen different? Contribute your evidence -- theory delta is what makes this knowledge base work.

Tested this tool yourself? Contribute your evidence -- confirmation, contradiction, or a fix.