From Theory Delta | Methodology | Published 2026-03-01
LLM gateways like LiteLLM, Portkey, and OpenRouter advertise production-grade features: budget enforcement, fallback routing, prompt caching, guardrails, and multi-provider load balancing. The pitch is simple -- put a gateway in front of your LLM calls and get reliability, cost control, and provider flexibility for free.
Gateway features are probabilistic, not deterministic. They work most of the time, but when they fail, they fail silently -- no exception, no error, just wrong behavior.
8 of 10 recent LiteLLM failures produce wrong output with no exception raised. The failure modes include:
LiteLLM has a ~300 RPS ceiling due to the Python GIL. This is not documented. Above this threshold, you get increased latency and dropped budget increments, not an error message.
Bifrost (Rust-based, designed to fix LiteLLM's performance ceiling) has its own 3 confirmed failures. Switching to a faster gateway does not eliminate the silent failure pattern -- it changes which features fail silently.
| Tool | Version | Result |
|---|---|---|
| LiteLLM | 1.55+ | 8 of 10 failures produce wrong behavior with no exception raised |
| Bifrost | latest (Mar 2026) | 3 confirmed failures in Rust gateway |
| OpenRouter | latest (Mar 2026) | Routing abstraction reviewed |
Confidence: empirical -- failure modes confirmed through runtime testing of LiteLLM under concurrent load. Bifrost failures confirmed through source review and issue tracking.
Falsification criterion: This claim would be disproved by demonstrating that LiteLLM budget enforcement maintains >95% counter accuracy under concurrent load (>50 RPS), or that fallback routing correctly avoids unhealthy providers in all tested scenarios.
Open questions: Does TensorZero avoid the silent failure pattern? What is the actual RPS ceiling for Bifrost before its failure modes appear? Has any gateway implemented end-to-end observability that would surface these silent failures?
Seen different? Contribute your evidence -- theory delta is what makes this knowledge base work.
Tested this tool yourself? Contribute your evidence -- confirmation, contradiction, or a fix.