Published: 2026-03-01 Last verified: 2026-03-22 empirical
16 claims 3 tested finding

Your LLM gateway will silently overspend your budget and bypass your guardrails

From Theory Delta | Methodology | Published 2026-03-01 | Updated 2026-03-25

You’re putting LiteLLM in front of your LLM calls because the docs promise cost control, fallback routing, and provider flexibility. That’s the standard pattern for multi-provider setups.

Here’s what will actually happen.

Your budget limits don’t work under concurrent load

LiteLLM’s budget and rate limit counters use read-modify-write without synchronization. Under concurrent requests, increments are silently lost — the counter never reflects actual usage.

How bad is it? Documented cases:

  • $50 budget, $764.78 actual spend — a user configured a $50 budget via AzureOpenAI client library. LiteLLM did not enforce it. Actual spend reached $764.78 before anyone noticed — a 15x overshoot with no alert, no exception, no error. (Issue #12977)

  • 6.6x rate limit overshoot — 5 concurrent requests against a 100 TPM limit consumed 663 tokens. All 5 requests read current_tpm=0 before any updated the counter. In a multi-tenant deployment, one customer can exhaust the provider quota for everyone. (Issue #18730)

  • Team budgets bypassed entirely — when virtual keys belong to teams, user-level budget enforcement is skipped. Unlimited overspend with no mechanism to detect it. (Issue #12905)

  • Pass-through routes ignore budgets completely — requests via pass-through endpoints are not budget-tracked at all. (Issue #10750)

What this means for you: If you’re relying on LiteLLM budget limits as your cost control, you don’t have cost control. Your staging environment (low concurrency) will pass. Production (concurrent requests) won’t. The budget says one number. Your provider bill says another. The gateway never tells you.

Your guardrails will silently stop working

LiteLLM’s content filtering and guardrail system has multiple confirmed bypass modes:

  • Content filters intermittently fail — German passports sometimes masked, sometimes not. SSN detection unreliable. Nine distinct guardrail failures documented in a single issue, mixing complete non-function with intermittent behavior. (Issue #19637)

  • GUI-configured guardrails never execute — guardrails defined via the UI appear in the interface but are never invoked on requests. (Issue #15584)

  • Model-level guardrails don’t take effectpre_call_hook runs before guardrails are attached to the request. (Issue #18363)

  • Post-call guardrails skipped on passthrough — passthrough routes execute pre_call guardrails but skip post_call. Content that should be filtered on output passes through. (Issue #20270)

  • Bedrock guardrail bypass — LiteLLM’s Bedrock integration returns the original unmodified input instead of the blocked/transformed output. The guardrail runs but its result is discarded. (Issue #22949)

What this means for you: If you’re using LiteLLM guardrails for content safety or compliance, content that should be blocked is reaching your users. Your guardrail dashboard shows “configured.” Your production traffic shows “not enforced.” There is no error.

Your prompt caching will silently stop saving you money

LiteLLM’s prompt caching works for some model versions and silently breaks for others:

  • 0% cache hits through proxy, 80% calling direct — Azure OpenAI’s gpt-5.2 prompt caching returns 0/5 cache hits through LiteLLM, vs 4/5 direct. The same proxy gets 5/5 with gpt-5.1. Model-specific parameter handling silently drops the cache key for certain models. No error. Your costs double. (Issue #18219)

  • Cache hits always report zerocache_hit metric shows 0 even when OpenAI’s API returns cached_tokens=1024. The caching is working upstream but the gateway doesn’t report it, making it impossible to verify from gateway metrics alone. (Issue #6229)

  • Load balancing defeats caching — if you use LiteLLM’s load balancer across multiple deployments of the same model, requests scatter across deployments, preventing cache warmup on any single one. The features work against each other. (Issue #6784)

What this means for you: If you enabled prompt caching expecting cost savings, verify cache hit rates by checking your provider’s billing directly — not the gateway’s metrics. If you also enabled load balancing, the two features may be cancelling each other out.

Your fallback routing may route to the failing provider

When a provider fails, the gateway should route to a healthy alternative. In practice:

  • Cascading fallback failure — when both primary and fallback deployments are unhealthy, the router attempts a fallback-for-the-fallback (which doesn’t exist), producing confusing errors that affect unrelated models. (Issue #17729)

  • The cooldown mechanism that should mark unhealthy providers has a race condition — the counter that tracks provider failures uses the same unsynchronized read-modify-write as the budget counters. Under concurrent load, providers that should be cooled down aren’t, because the failure counter never reaches the threshold. Requests continue routing to dead providers. (Issue #20977)

  • Mid-stream fallback injects a default system prompt — when a streaming call fails mid-response and falls back to another provider, LiteLLM prepends “You are a helpful assistant…” to the continuation context. Your domain-specific persona is silently overwritten. No config option to disable this. (Issue #18229, no fix planned)

  • Multimodal fallback drops images — the OpenAI handler mutates the shared messages object in place. When falling back to Gemini, the image data is missing because Gemini requires Base64 inline_data while OpenAI uses URL image_url. Zero image tokens in usage confirms the loss. (Issue #15803, closed not-planned)

What this means for you: Your multi-provider resilience layer — the reason you added a gateway — may be the thing that’s failing. If you’re debugging latency spikes or wrong outputs that look like LLM problems, check whether the gateway is routing to a provider it should have marked as unhealthy.

The pattern: gateway features fail silently

This is what makes gateways dangerous: they don’t crash when they fail. Budget counters drift. Guardrails pass. Cache misses go unreported. Fallbacks route to dead providers. Your monitoring says everything is fine because no exceptions are thrown.

We analysed the LiteLLM issues documented above plus additional confirmed failures in config/state management and parameter forwarding. The consistent pattern: wrong behavior with no exception raised. Detection requires monitoring the actual forwarded request payloads and provider billing, not just your gateway’s metrics and exception logs.

What to do

  1. Track spend independently. Query your provider’s billing API or maintain your own token counter. Compare against the gateway’s reported spend weekly. If they diverge, the gateway is wrong. The documented worst case is 15x overshoot (#12977).

  2. Verify guardrails end-to-end. Send known-bad content through your production guardrail configuration and verify it’s actually blocked. Do this after every LiteLLM upgrade — guardrail regressions are documented across multiple versions.

  3. Check cache hit rates at the provider, not the gateway. If your provider’s billing shows cached tokens but the gateway reports zero, the gateway’s cache metrics are broken (#6229). If load balancing is on, consider pinning cache-heavy workloads to a single deployment.

  4. Test fallback routing under failure conditions. Simulate a provider outage under concurrent load and verify the gateway actually routes away. The cooldown race condition means it may not (#20977).

  5. Keep gateway config minimal. Each feature you enable is a feature that can fail silently. Use the gateway for routing. Let purpose-built tools handle guardrails, budgets, and caching independently.

  6. If you’re considering alternatives: Bifrost (Go, ~2,600 stars) and TensorZero (Rust, ~11,000 stars) avoid the Python GIL constraint. Both are younger with smaller provider ecosystems. Bifrost has documented issues with parallel tool call streaming and missing finish_reason on vLLM-backed models. TensorZero explicitly documents provider capability gaps rather than silently dropping unsupported parameters. Switching gateways changes which features fail silently, not whether they do — verify any gateway’s claims against your own workload.

Evidence

All claims in this finding trace to public GitHub issues. No external benchmarks or aggregate statistics are used unless explicitly attributed.

ClaimPrimary SourceVerified
Budget bypass: $50 configured, $764.78 actualIssue #12977Yes — user report with spend data
TPM rate limit bypass: 6.6x overshootIssue #18730Yes — reproduction with 5 concurrent requests
Team budget bypass: unlimited overspendIssue #12905Yes — confirmed by reporter
Pass-through routes ignore budgetsIssue #10750Yes — confirmed
Content filter intermittent failure (9 modes)Issue #19637Yes — detailed reproduction
GUI guardrails never invokedIssue #15584Yes — confirmed
Model-level guardrails don’t take effectIssue #18363Yes — pre_call_hook timing confirmed
Post-call guardrails skipped on passthroughIssue #20270Yes — confirmed
Bedrock guardrail output discardedIssue #22949Yes — confirmed
Azure prompt caching 0% via proxyIssue #18219Yes — 0/5 via proxy vs 4/5 direct
Cache hit metric always zeroIssue #6229Yes — confirmed
Load balancing defeats cachingIssue #6784Yes — architectural conflict
Cascading fallback failureIssue #17729Yes — confirmed on v1.74.9
Cooldown counter race conditionIssue #20977Yes — reproduction published
Mid-stream fallback injects system promptIssue #18229Yes — no fix planned
Multimodal fallback drops imagesIssue #15803Yes — closed not-planned

Confidence: Every claim above links to a public GitHub issue with a reproduction or confirmation. Where Theory Delta has performed its own analysis (e.g., the “silent failure” classification), this is stated explicitly rather than presented as an external finding.

What would disprove this: A LiteLLM release that ships atomic budget counters, verified end-to-end guardrail execution across all configuration methods, and accurate cache hit reporting. As of March 2026, the PRs addressing the budget race condition (#20979) and multiple guardrail fixes remain unmerged.

Last verified: 2026-03-22

Seen different? Contribute your evidence (confirming or contradicting) via the Theory Delta MCP contribute tool or at theorydelta.com/contribute.

Environments Tested

Tool Version Result
LiteLLM 1.55+ 16 silent failure modes confirmed via public GitHub issues with reproductions
Bifrost latest (Mar 2026) Parallel tool call streaming and finish_reason issues documented
TensorZero latest (Mar 2026) Explicitly documents provider capability gaps