Your LLM gateway will silently overspend your budget and bypass your guardrails
From Theory Delta | Methodology | Published 2026-03-01 | Updated 2026-03-25
You’re putting LiteLLM in front of your LLM calls because the docs promise cost control, fallback routing, and provider flexibility. That’s the standard pattern for multi-provider setups.
Here’s what will actually happen.
Your budget limits don’t work under concurrent load
LiteLLM’s budget and rate limit counters use read-modify-write without synchronization. Under concurrent requests, increments are silently lost — the counter never reflects actual usage.
How bad is it? Documented cases:
-
$50 budget, $764.78 actual spend — a user configured a $50 budget via AzureOpenAI client library. LiteLLM did not enforce it. Actual spend reached $764.78 before anyone noticed — a 15x overshoot with no alert, no exception, no error. (Issue #12977)
-
6.6x rate limit overshoot — 5 concurrent requests against a 100 TPM limit consumed 663 tokens. All 5 requests read
current_tpm=0before any updated the counter. In a multi-tenant deployment, one customer can exhaust the provider quota for everyone. (Issue #18730) -
Team budgets bypassed entirely — when virtual keys belong to teams, user-level budget enforcement is skipped. Unlimited overspend with no mechanism to detect it. (Issue #12905)
-
Pass-through routes ignore budgets completely — requests via pass-through endpoints are not budget-tracked at all. (Issue #10750)
What this means for you: If you’re relying on LiteLLM budget limits as your cost control, you don’t have cost control. Your staging environment (low concurrency) will pass. Production (concurrent requests) won’t. The budget says one number. Your provider bill says another. The gateway never tells you.
Your guardrails will silently stop working
LiteLLM’s content filtering and guardrail system has multiple confirmed bypass modes:
-
Content filters intermittently fail — German passports sometimes masked, sometimes not. SSN detection unreliable. Nine distinct guardrail failures documented in a single issue, mixing complete non-function with intermittent behavior. (Issue #19637)
-
GUI-configured guardrails never execute — guardrails defined via the UI appear in the interface but are never invoked on requests. (Issue #15584)
-
Model-level guardrails don’t take effect —
pre_call_hookruns before guardrails are attached to the request. (Issue #18363) -
Post-call guardrails skipped on passthrough — passthrough routes execute
pre_callguardrails but skippost_call. Content that should be filtered on output passes through. (Issue #20270) -
Bedrock guardrail bypass — LiteLLM’s Bedrock integration returns the original unmodified input instead of the blocked/transformed output. The guardrail runs but its result is discarded. (Issue #22949)
What this means for you: If you’re using LiteLLM guardrails for content safety or compliance, content that should be blocked is reaching your users. Your guardrail dashboard shows “configured.” Your production traffic shows “not enforced.” There is no error.
Your prompt caching will silently stop saving you money
LiteLLM’s prompt caching works for some model versions and silently breaks for others:
-
0% cache hits through proxy, 80% calling direct — Azure OpenAI’s gpt-5.2 prompt caching returns 0/5 cache hits through LiteLLM, vs 4/5 direct. The same proxy gets 5/5 with gpt-5.1. Model-specific parameter handling silently drops the cache key for certain models. No error. Your costs double. (Issue #18219)
-
Cache hits always report zero —
cache_hitmetric shows 0 even when OpenAI’s API returnscached_tokens=1024. The caching is working upstream but the gateway doesn’t report it, making it impossible to verify from gateway metrics alone. (Issue #6229) -
Load balancing defeats caching — if you use LiteLLM’s load balancer across multiple deployments of the same model, requests scatter across deployments, preventing cache warmup on any single one. The features work against each other. (Issue #6784)
What this means for you: If you enabled prompt caching expecting cost savings, verify cache hit rates by checking your provider’s billing directly — not the gateway’s metrics. If you also enabled load balancing, the two features may be cancelling each other out.
Your fallback routing may route to the failing provider
When a provider fails, the gateway should route to a healthy alternative. In practice:
-
Cascading fallback failure — when both primary and fallback deployments are unhealthy, the router attempts a fallback-for-the-fallback (which doesn’t exist), producing confusing errors that affect unrelated models. (Issue #17729)
-
The cooldown mechanism that should mark unhealthy providers has a race condition — the counter that tracks provider failures uses the same unsynchronized read-modify-write as the budget counters. Under concurrent load, providers that should be cooled down aren’t, because the failure counter never reaches the threshold. Requests continue routing to dead providers. (Issue #20977)
-
Mid-stream fallback injects a default system prompt — when a streaming call fails mid-response and falls back to another provider, LiteLLM prepends “You are a helpful assistant…” to the continuation context. Your domain-specific persona is silently overwritten. No config option to disable this. (Issue #18229, no fix planned)
-
Multimodal fallback drops images — the OpenAI handler mutates the shared messages object in place. When falling back to Gemini, the image data is missing because Gemini requires Base64
inline_datawhile OpenAI uses URLimage_url. Zero image tokens in usage confirms the loss. (Issue #15803, closed not-planned)
What this means for you: Your multi-provider resilience layer — the reason you added a gateway — may be the thing that’s failing. If you’re debugging latency spikes or wrong outputs that look like LLM problems, check whether the gateway is routing to a provider it should have marked as unhealthy.
The pattern: gateway features fail silently
This is what makes gateways dangerous: they don’t crash when they fail. Budget counters drift. Guardrails pass. Cache misses go unreported. Fallbacks route to dead providers. Your monitoring says everything is fine because no exceptions are thrown.
We analysed the LiteLLM issues documented above plus additional confirmed failures in config/state management and parameter forwarding. The consistent pattern: wrong behavior with no exception raised. Detection requires monitoring the actual forwarded request payloads and provider billing, not just your gateway’s metrics and exception logs.
What to do
-
Track spend independently. Query your provider’s billing API or maintain your own token counter. Compare against the gateway’s reported spend weekly. If they diverge, the gateway is wrong. The documented worst case is 15x overshoot (#12977).
-
Verify guardrails end-to-end. Send known-bad content through your production guardrail configuration and verify it’s actually blocked. Do this after every LiteLLM upgrade — guardrail regressions are documented across multiple versions.
-
Check cache hit rates at the provider, not the gateway. If your provider’s billing shows cached tokens but the gateway reports zero, the gateway’s cache metrics are broken (#6229). If load balancing is on, consider pinning cache-heavy workloads to a single deployment.
-
Test fallback routing under failure conditions. Simulate a provider outage under concurrent load and verify the gateway actually routes away. The cooldown race condition means it may not (#20977).
-
Keep gateway config minimal. Each feature you enable is a feature that can fail silently. Use the gateway for routing. Let purpose-built tools handle guardrails, budgets, and caching independently.
-
If you’re considering alternatives: Bifrost (Go, ~2,600 stars) and TensorZero (Rust, ~11,000 stars) avoid the Python GIL constraint. Both are younger with smaller provider ecosystems. Bifrost has documented issues with parallel tool call streaming and missing finish_reason on vLLM-backed models. TensorZero explicitly documents provider capability gaps rather than silently dropping unsupported parameters. Switching gateways changes which features fail silently, not whether they do — verify any gateway’s claims against your own workload.
Evidence
All claims in this finding trace to public GitHub issues. No external benchmarks or aggregate statistics are used unless explicitly attributed.
| Claim | Primary Source | Verified |
|---|---|---|
| Budget bypass: $50 configured, $764.78 actual | Issue #12977 | Yes — user report with spend data |
| TPM rate limit bypass: 6.6x overshoot | Issue #18730 | Yes — reproduction with 5 concurrent requests |
| Team budget bypass: unlimited overspend | Issue #12905 | Yes — confirmed by reporter |
| Pass-through routes ignore budgets | Issue #10750 | Yes — confirmed |
| Content filter intermittent failure (9 modes) | Issue #19637 | Yes — detailed reproduction |
| GUI guardrails never invoked | Issue #15584 | Yes — confirmed |
| Model-level guardrails don’t take effect | Issue #18363 | Yes — pre_call_hook timing confirmed |
| Post-call guardrails skipped on passthrough | Issue #20270 | Yes — confirmed |
| Bedrock guardrail output discarded | Issue #22949 | Yes — confirmed |
| Azure prompt caching 0% via proxy | Issue #18219 | Yes — 0/5 via proxy vs 4/5 direct |
| Cache hit metric always zero | Issue #6229 | Yes — confirmed |
| Load balancing defeats caching | Issue #6784 | Yes — architectural conflict |
| Cascading fallback failure | Issue #17729 | Yes — confirmed on v1.74.9 |
| Cooldown counter race condition | Issue #20977 | Yes — reproduction published |
| Mid-stream fallback injects system prompt | Issue #18229 | Yes — no fix planned |
| Multimodal fallback drops images | Issue #15803 | Yes — closed not-planned |
Confidence: Every claim above links to a public GitHub issue with a reproduction or confirmation. Where Theory Delta has performed its own analysis (e.g., the “silent failure” classification), this is stated explicitly rather than presented as an external finding.
What would disprove this: A LiteLLM release that ships atomic budget counters, verified end-to-end guardrail execution across all configuration methods, and accurate cache hit reporting. As of March 2026, the PRs addressing the budget race condition (#20979) and multiple guardrail fixes remain unmerged.
Last verified: 2026-03-22
Seen different? Contribute your evidence (confirming or contradicting) via the Theory Delta MCP contribute tool or at theorydelta.com/contribute.
Environments Tested
| Tool | Version | Result |
|---|---|---|
| LiteLLM | 1.55+ | 16 silent failure modes confirmed via public GitHub issues with reproductions |
| Bifrost | latest (Mar 2026) | Parallel tool call streaming and finish_reason issues documented |
| TensorZero | latest (Mar 2026) | Explicitly documents provider capability gaps |