Your LLM gateway will silently overspend your budget and bypass your guardrails

Published: 2026-03-01Last verified: 2026-03-22empirical

Published Fact-checked 2026-03-22 · 0 corrections

Your LLM gateway will silently overspend your budget and bypass your guardrails

From Theory Delta | Methodology | Published 2026-03-01 | Updated 2026-03-25

You’re putting LiteLLM in front of your LLM calls because the docs promise cost control, fallback routing, and provider flexibility. That’s the standard pattern for multi-provider setups.

Here’s what will actually happen.

What you expect

LiteLLM’s budget limits enforce your spending ceiling, guardrails filter content reliably, prompt caching saves money, and fallback routing ensures resilience when providers fail — all without surfacing errors to your application when something silently breaks.

What actually happens

Every major LiteLLM gateway feature fails silently under production conditions: budget counters drift under concurrency, guardrails pass banned content, caching metrics lie, and fallback routing routes to dead providers — all without raising exceptions.

Your budget limits don’t work under concurrent load

LiteLLM’s budget and rate limit counters use read-modify-write without synchronization. Under concurrent requests, increments are silently lost — the counter never reflects actual usage.

How bad is it? Documented cases:

$50 budget, $764.78 actual spend — a user configured a $50 budget via AzureOpenAI client library. LiteLLM did not enforce it. Actual spend reached $764.78 before anyone noticed — a 15x overshoot with no alert, no exception, no error. (Issue #12977)
6.6x rate limit overshoot — 5 concurrent requests against a 100 TPM limit consumed 663 tokens. All 5 requests read current_tpm=0 before any updated the counter. In a multi-tenant deployment, one customer can exhaust the provider quota for everyone. (Issue #18730)
Team budgets bypassed entirely — when virtual keys belong to teams, user-level budget enforcement is skipped. Unlimited overspend with no mechanism to detect it. (Issue #12905)
Pass-through routes ignore budgets completely — requests via pass-through endpoints are not budget-tracked at all. (Issue #10750)

What this means for you: If you’re relying on LiteLLM budget limits as your cost control, you don’t have cost control. Your staging environment (low concurrency) will pass. Production (concurrent requests) won’t. The budget says one number. Your provider bill says another. The gateway never tells you.

Your guardrails will silently stop working

LiteLLM’s content filtering and guardrail system has multiple confirmed bypass modes:

Content filters intermittently fail — German passports sometimes masked, sometimes not. SSN detection unreliable. Nine distinct guardrail failures documented in a single issue, mixing complete non-function with intermittent behavior. (Issue #19637)
GUI-configured guardrails never execute — guardrails defined via the UI appear in the interface but are never invoked on requests. (Issue #15584)
Model-level guardrails don’t take effect — pre_call_hook runs before guardrails are attached to the request. (Issue #18363)
Post-call guardrails skipped on passthrough — passthrough routes execute pre_call guardrails but skip post_call. Content that should be filtered on output passes through. (Issue #20270)
Bedrock guardrail bypass — LiteLLM’s Bedrock integration returns the original unmodified input instead of the blocked/transformed output. The guardrail runs but its result is discarded. (Issue #22949)

What this means for you: If you’re using LiteLLM guardrails for content safety or compliance, content that should be blocked is reaching your users. Your guardrail dashboard shows “configured.” Your production traffic shows “not enforced.” There is no error.

Your prompt caching will silently stop saving you money

LiteLLM’s prompt caching works for some model versions and silently breaks for others:

0% cache hits through proxy, 80% calling direct — Azure OpenAI’s gpt-5.2 prompt caching returns 0/5 cache hits through LiteLLM, vs 4/5 direct. The same proxy gets 5/5 with gpt-5.1. Model-specific parameter handling silently drops the cache key for certain models. No error. Your costs double. (Issue #18219)
Cache hits always report zero — cache_hit metric shows 0 even when OpenAI’s API returns cached_tokens=1024. The caching is working upstream but the gateway doesn’t report it, making it impossible to verify from gateway metrics alone. (Issue #6229)
Load balancing defeats caching — if you use LiteLLM’s load balancer across multiple deployments of the same model, requests scatter across deployments, preventing cache warmup on any single one. The features work against each other. (Issue #6784)

What this means for you: If you enabled prompt caching expecting cost savings, verify cache hit rates by checking your provider’s billing directly — not the gateway’s metrics. If you also enabled load balancing, the two features may be cancelling each other out.

Your fallback routing may route to the failing provider

When a provider fails, the gateway should route to a healthy alternative. In practice:

Cascading fallback failure — when both primary and fallback deployments are unhealthy, the router attempts a fallback-for-the-fallback (which doesn’t exist), producing confusing errors that affect unrelated models. (Issue #17729)
The cooldown mechanism that should mark unhealthy providers has a race condition — the counter that tracks provider failures uses the same unsynchronized read-modify-write as the budget counters. Under concurrent load, providers that should be cooled down aren’t, because the failure counter never reaches the threshold. Requests continue routing to dead providers. (Issue #20977)
Mid-stream fallback injects a default system prompt — when a streaming call fails mid-response and falls back to another provider, LiteLLM prepends “You are a helpful assistant…” to the continuation context. Your domain-specific persona is silently overwritten. No config option to disable this. (Issue #18229, no fix planned)
Multimodal fallback drops images — the OpenAI handler mutates the shared messages object in place. When falling back to Gemini, the image data is missing because Gemini requires Base64 inline_data while OpenAI uses URL image_url. Zero image tokens in usage confirms the loss. (Issue #15803, closed not-planned)

What this means for you: Your multi-provider resilience layer — the reason you added a gateway — may be the thing that’s failing. If you’re debugging latency spikes or wrong outputs that look like LLM problems, check whether the gateway is routing to a provider it should have marked as unhealthy.

What this means for you

If you rely on LiteLLM budget limits as your cost control, you don’t have cost control — the documented worst case is 15x overshoot. If you use guardrails for content safety, content that should be blocked is reaching your users with no error raised. If you enabled prompt caching, verify hits at the provider, not the gateway. If you added multi-provider fallback for resilience, the fallback itself may be routing to a dead provider under load.

The pattern: gateway features fail silently

This is what makes gateways dangerous: they don’t crash when they fail. Budget counters drift. Guardrails pass. Cache misses go unreported. Fallbacks route to dead providers. Your monitoring says everything is fine because no exceptions are thrown.

We analysed the LiteLLM issues documented above plus additional confirmed failures in config/state management and parameter forwarding. The consistent pattern: wrong behavior with no exception raised. Detection requires monitoring the actual forwarded request payloads and provider billing, not just your gateway’s metrics and exception logs.

What to do

Track spend independently. Query your provider’s billing API or maintain your own token counter. Compare against the gateway’s reported spend weekly. If they diverge, the gateway is wrong. The documented worst case is 15x overshoot (#12977).
Verify guardrails end-to-end. Send known-bad content through your production guardrail configuration and verify it’s actually blocked. Do this after every LiteLLM upgrade — guardrail regressions are documented across multiple versions.
Check cache hit rates at the provider, not the gateway. If your provider’s billing shows cached tokens but the gateway reports zero, the gateway’s cache metrics are broken (#6229). If load balancing is on, consider pinning cache-heavy workloads to a single deployment.
Test fallback routing under failure conditions. Simulate a provider outage under concurrent load and verify the gateway actually routes away. The cooldown race condition means it may not (#20977).
Keep gateway config minimal. Each feature you enable is a feature that can fail silently. Use the gateway for routing. Let purpose-built tools handle guardrails, budgets, and caching independently.
If you’re considering alternatives: Bifrost (Go, ~2,600 stars) and TensorZero (Rust, ~11,000 stars) avoid the Python GIL constraint. Both are younger with smaller provider ecosystems. Bifrost has documented issues with parallel tool call streaming and missing finish_reason on vLLM-backed models. TensorZero explicitly documents provider capability gaps rather than silently dropping unsupported parameters. Switching gateways changes which features fail silently, not whether they do — verify any gateway’s claims against your own workload.

Evidence

All claims in this finding trace to public GitHub issues. No external benchmarks or aggregate statistics are used unless explicitly attributed.

Claim	Primary Source	Verified
Budget bypass: $50 configured, $764.78 actual	Issue #12977	Yes — user report with spend data
TPM rate limit bypass: 6.6x overshoot	Issue #18730	Yes — reproduction with 5 concurrent requests
Team budget bypass: unlimited overspend	Issue #12905	Yes — confirmed by reporter
Pass-through routes ignore budgets	Issue #10750	Yes — confirmed
Content filter intermittent failure (9 modes)	Issue #19637	Yes — detailed reproduction
GUI guardrails never invoked	Issue #15584	Yes — confirmed
Model-level guardrails don’t take effect	Issue #18363	Yes — pre_call_hook timing confirmed
Post-call guardrails skipped on passthrough	Issue #20270	Yes — confirmed
Bedrock guardrail output discarded	Issue #22949	Yes — confirmed
Azure prompt caching 0% via proxy	Issue #18219	Yes — 0/5 via proxy vs 4/5 direct
Cache hit metric always zero	Issue #6229	Yes — confirmed
Load balancing defeats caching	Issue #6784	Yes — architectural conflict
Cascading fallback failure	Issue #17729	Yes — confirmed on v1.74.9
Cooldown counter race condition	Issue #20977	Yes — reproduction published
Mid-stream fallback injects system prompt	Issue #18229	Yes — no fix planned
Multimodal fallback drops images	Issue #15803	Yes — closed not-planned

Confidence: Every claim above links to a public GitHub issue with a reproduction or confirmation. Where Theory Delta has performed its own analysis (e.g., the “silent failure” classification), this is stated explicitly rather than presented as an external finding.

Falsification criterion: A LiteLLM release that ships atomic budget counters, verified end-to-end guardrail execution across all configuration methods, and accurate cache hit reporting would disprove this finding; as of March 2026, the PRs addressing the budget race condition (#20979) and multiple guardrail fixes remain unmerged.

Last verified: 2026-03-22

Seen different? Contribute your evidence — share a repro or counter-example and we’ll review it against this finding. Reader evidence is what keeps these findings accurate.