Stockyard caches LLM responses at the proxy layer. Exact match and semantic similarity, stored in SQLite. No Redis, no external cache service, no configuration beyond flipping a switch.
LLM API calls are expensive and slow. During development, you send the same prompt dozens of times while iterating. In production, multiple users ask similar questions and each one costs you a full API call. Without caching, every request hits the provider, bills tokens, and waits for a response.
Most caching solutions require running Redis or Memcached alongside your proxy. That is another service to deploy, monitor, and pay for. Stockyard builds caching directly into the proxy binary.
Exact match returns a cached response when the prompt, model, and parameters are identical. This is what you want during development when you are running the same test prompt repeatedly.
Semantic caching returns a cached response when the prompt is similar enough to a previous one, even if the wording is different. This is what you want in production when different users ask the same question in different ways.
Caching is a middleware module. Enable it at runtime without restarting:
Or set it in your config file:
Cached responses include an X-Stockyard-Cache: hit header so you can tell when a response came from cache versus the provider.
In the same SQLite file as everything else. No Redis instance, no separate cache tier, no cache invalidation service. Cached responses are stored alongside traces and costs in stockyard.db.
Backing up the cache is the same as backing up everything else: cp stockyard.db stockyard.db.bak
Development loops. You are iterating on a prompt and sending the same request 30 times while tweaking your system prompt. Without caching, that is 30 API calls. With exact-match caching, it is 1 API call and 29 instant responses.
Customer support bots. Users ask "how do I reset my password" in 15 different ways. Without caching, each one costs tokens. With semantic caching, the first one hits the provider and the rest return the cached response.
Internal tools. Your team runs the same summarization prompts on daily reports. The reports change but the prompt template does not. Exact-match caching eliminates duplicate calls when the same report is summarized twice.
Caching works best for deterministic or near-deterministic workloads. If every request is unique and requires a fresh response, caching will not help. It also does not cache streaming responses mid-stream. The full response is cached once streaming completes.
Caching is one module in Stockyard's 76-module middleware chain. Enable it in one API call, disable it when you do not need it.
Get started