The True Cost of LLM Sprawl: A 12-Month Audit
In the eleven enterprise environments we have audited over the last twelve months, AI infrastructure spend has grown 5–8× faster than usage. The reason is not that LLMs got more expensive. The reason is that nobody is metering the work.
The Audit
Over Q1 2025 through Q1 2026 we conducted cost audits for eleven enterprises with annualized LLM spend ranging from $180,000 to $1.4 million. Total in-scope spend: $4.2 million. The audit method was simple. We instrumented their existing AI gateways (where they had one) or attached observability to their direct OpenAI / Anthropic API keys (where they did not) for ninety days, then decomposed every dollar by team, by application, by model class, and by outcome.
Across all eleven environments, the same four buckets accounted for 100% of spend. The proportions varied by 6–9 percentage points around the means above, but the rank order never changed.
Bucket 1: Idle & Abandoned Experimentation (31%)
This is the largest line item in nine of eleven environments and the second largest in the other two. Engineers spin up a notebook to evaluate a use case. They use a corporate API key. They get a result, ship the prototype, and move on to the next sprint. The notebook stays. Sometimes it stays running. Sometimes the long-running notebook makes a single call every five minutes for nine months, indefinitely, because nobody owns shutting it down.
One environment was paying $4,800/month for a single forgotten Jupyter kernel that had been making OpenAI calls since June 2024. Nobody knew it existed until the audit surfaced it. The kernel belonged to an engineer who had since left the company.
This is not a tooling failure. It is a metering failure. If the AI gateway is not associating every call with a workspace, an owner, and a hard expiry, this bucket will not shrink.
Bucket 2: Over-provisioned Model Class (26%)
The pattern is consistent. A team builds a feature. They prompt-engineer it on GPT-4 because that is what was available on launch day. They ship it. The feature works. It scales. Twelve months later they are sending 6 million tokens a day to GPT-4, and 95% of those calls would produce identical output on a model two-tenths the cost.
The fix is not "switch the model." The fix is automated routing: every call is evaluated against a quality gate, and if the cheaper model passes the gate for that workload, the call is silently re-routed without changing the application code. The control plane's job is to be the policy. The application's job is to call the gateway.
Bucket 3: Retry Storms & Misconfigured Timeouts (17%)
This bucket is the one that surprises CFOs. Most production AI traffic carries a retry policy. Most retry policies were written before anyone understood what a model timeout actually means. The result is that a transient 5xx from the upstream provider triggers three immediate retries, each of which is metered, each of which generates real cost. If the upstream provider is having a bad afternoon, the retry storm can quintuple the customer's monthly bill in a six-hour window.
Three controls reduce this bucket to under 2%:
- Cost-aware retry budget. Per-workspace, per-hour cap on retry expenditure. When the budget is hit, the gateway returns a structured error to the caller instead of retrying.
- Idempotency keys. A retry that arrives within the dedup window is served from the cached prior response, not re-invoked upstream.
- Circuit breaker. When the upstream provider's error rate exceeds a threshold, the gateway stops retrying and starts failing fast. Operators get paged. The bill stops growing.
Bucket 4: Actual Delivered Value (26%)
This is what the budget should be funding. In every environment we audited, this bucket was a quarter or less of the total. After remediation, it became more than half — without reducing the application's actual capability or response quality.
The Eight Controls
Here is the playbook we recommend, in order of impact. Most environments achieve a 40–55% reduction in annualized spend within ninety days of full implementation, with no measurable degradation in user-facing quality. The numbers are conservative; the largest single result was 71%.
- Per-workspace metering with hard caps. Every API call carries a workspace identifier, every workspace has a monthly cap, and the cap is enforced at the gateway, not in a billing system that fires alerts after the budget is gone. The cap is the kill switch.
- Automated quality-gated model routing. Cheaper model first, expensive model only if the response fails an evaluator. Configurable per use case.
- Response caching with semantic deduplication. Identical or near-identical prompts in the same workspace within a configurable window do not pay twice.
- Cost-aware retry policy. Per-hour retry budget. When exhausted, requests fail fast with a structured error.
- Idle workspace detection. Any workspace with zero active users for thirty days is auto-suspended. Reactivation requires an owner and a stated purpose.
- Per-call tagging. Every call is labeled with the requesting application, the user, and the cost center. This is the input to FinOps.
- Token budget review at sprint planning. Engineering teams review their AI spend the same way they review cloud spend — weekly, with the team that owns the workload.
- Predictive cost ceiling. The gateway projects end-of-month spend based on rolling 7-day usage. When the projection exceeds the cap, the cap engages early at the workspace level.
Why This is Architectural, Not Operational
None of these controls work if they are bolted onto a SaaS gateway as features that the customer enables. They work because the gateway is the single chokepoint through which all AI traffic flows, and the gateway enforces policy as code. The moment a team has its own API key in its own .env file talking directly to OpenAI, all eight controls are bypassed and the audit pattern begins over.
The architectural prerequisite is therefore: one gateway, no exceptions, every API key revoked, every call mediated. This is the same architectural prerequisite that lets you answer your CISO's third-party risk questions, your CFO's monthly review questions, and your auditor's "where is the data" questions. We have written about this elsewhere.
The 90-Day Result
The eleven environments that completed the full ninety-day program saw the spend curve flatten or invert. The "actual delivered value" bucket grew in absolute terms because the savings from the other three buckets were redeployed into more usage. The user experience of the deployed AI features did not change. The CFO's monthly variance shrank from ±28% to ±4%.
That is the cost engineering case for a sovereign AI gateway. The compliance case is in a separate piece. The HIPAA-specific implementation reference is in a third.
Considering a cost audit on your existing AI spend?
We will walk through your last 90 days of LLM traffic and produce a remediation plan with projected savings, before you commit to any product. No pitch deck, no obligation.
Request an audit →