Your LLM API bill just doubled. Here are your options.
You shipped a feature that uses an LLM API. Usage grew. The bill followed. Now you are looking at the invoice trying to figure out where the money went and how to make it stop.
This is not a pricing problem. It is a visibility and control problem. Most LLM providers give you an API key and a dashboard that shows aggregate spend a day or two behind. They do not give you per-request cost tracking, spending caps, caching, or model routing. You are expected to build all of that yourself, or just absorb the cost.
There are really only four levers that matter for controlling LLM costs, and most teams are not using any of them.
Lever 1: Know what you are spending per request
The first problem is visibility. Provider dashboards show aggregate spend, but they do not tell you which endpoint, which user, or which prompt template is expensive. Without per-request cost data, you are optimizing blind.
The fix is to log cost per request as it happens. That means tracking the model, token count, and estimated cost for every API call, and making that data queryable. If you run requests through a proxy that records this automatically, you get the data without instrumenting every call site in your application.
# Example: query cost data from Stockyard curl http://localhost:4200/api/observe/costs/summary \ -H "X-Admin-Key: $STOCKYARD_ADMIN_KEY" {"period":"24h","total_usd":4.23,"by_model":{"gpt-4o":3.41,"gpt-4o-mini":0.82},"requests":847}
Now you can see that GPT-4o is 81% of your spend. That is actionable. Maybe most of those requests do not need GPT-4o.
Lever 2: Cache repeated work
During development, you send the same prompt dozens of times while iterating. In production, different users ask similar questions. Without caching, each one is a fresh API call at full price.
Exact-match caching catches identical requests. Semantic caching catches requests that are different in wording but similar in meaning. Both return a stored response instead of making a new API call.
The cost savings depend entirely on your workload. If your app handles a lot of repeated or similar queries, caching can eliminate a meaningful portion of API calls. If every request is unique, caching will not help much. The only way to know is to turn it on and measure.
# Enable caching with a 1-hour TTL curl -X PUT http://localhost:4200/api/proxy/modules/cachelayer \ -H "X-Admin-Key: $STOCKYARD_ADMIN_KEY" \ -H "Content-Type: application/json" \ -d '{"enabled": true, "config": {"ttl": 3600}}'
Cached responses include an X-Stockyard-Cache: hit header so you can measure the actual hit rate for your workload.
Lever 3: Use the cheapest model that works
Not every request needs your most expensive model. A classification task, a simple extraction, or a formatting operation probably works fine on a smaller, cheaper model. But switching models usually means changing code, redeploying, and hoping nothing breaks.
Model aliasing lets you decouple model selection from application code. Your app requests a logical name like "fast" or "smart." The proxy maps it to a real model. When you want to swap gpt-4o for gpt-4o-mini on your classification endpoint, you change one alias instead of redeploying.
# Map "fast" to the cheapest model curl -X PUT http://localhost:4200/api/proxy/aliases \ -H "X-Admin-Key: $STOCKYARD_ADMIN_KEY" \ -H "Content-Type: application/json" \ -d '{"alias": "fast", "model": "gpt-4o-mini"}'
This also lets you test new models safely. Map "fast" to the new model for an hour, check the traces, and switch back if quality drops.
Lever 4: Set a hard spending limit
Most LLM APIs have no built-in spending cap. If your app has a bug that loops, or a user finds a way to trigger expensive prompts repeatedly, your bill absorbs it until you notice. That can take days.
A spending cap at the proxy level catches this before it hurts. When the daily limit is reached, the proxy returns a 429 instead of forwarding the request. Your app gets a clear signal to back off, and your budget stays intact.
# Set a $10/day spending cap curl -X PUT http://localhost:4200/api/proxy/modules/costcap \ -H "X-Admin-Key: $STOCKYARD_ADMIN_KEY" \ -H "Content-Type: application/json" \ -d '{"enabled": true, "config": {"daily_limit_usd": 10.0}}'
This is the safety net that should have existed from day one. It is surprising how many teams run production LLM workloads with no spending limit at all.
Why a proxy is the right place for this
All four of these levers work best when they sit between your application and the LLM provider. A proxy sees every request, can cache and rate-limit transparently, and records cost data without requiring changes to your application code.
You can build each of these yourself. Write a caching layer, add cost logging, build a model routing system, implement rate limiting. Or you can put a proxy in front of your provider and get all four in one place.
I built Stockyard because I had this exact problem. The examples in this post are real Stockyard API calls, but the concepts apply to any proxy setup. The important thing is to have visibility and controls between your app and the API, not to use any specific tool.
If you want to try it: curl -fsSL stockyard.dev/install.sh | sh gets you a running proxy in under a minute. The cost tracking, caching, aliasing, and spending caps are all available on the free tier.