Caching
Cache identical requests to save money and reduce latency.
Enable caching
Caching is disabled by default. Enable it via the API:
curl -X PUT http://localhost:4200/api/proxy/modules/cache \ -d '{"enabled": true}'
Or in your config file:
cache: enabled: true strategy: exact # exact or semantic ttl: 1h max_entries: 10000
How it works
When a request comes in, Stockyard hashes the model name and message array. If a matching response exists in cache and the TTL has not expired, the cached response is returned immediately. No API call is made to the provider.
Cache keys include the user ID by default to prevent cross-user cache leakage. Anonymous requests share a cache.
Cache strategies
exact matches on identical model + messages. This is the default and works well for tab completions, classification tasks, and any workload with repeated inputs.
semantic uses embedding similarity to match "close enough" requests. This catches requests that are semantically identical but differ in whitespace, punctuation, or minor wording. Higher hit rate but slightly more overhead.
Response headers
Cached responses include X-Stockyard-Cache: HIT in the response headers. Cache misses include X-Stockyard-Cache: MISS. Your application can use these to track cache effectiveness.
Cache bypass
To bypass the cache for a specific request, add the header X-Stockyard-No-Cache: true to your request. The response will still be cached for future requests.
Embedding cache
Embedding requests (/v1/embeddings) have their own cache keyed by content hash. Enable it separately:
embedcache: enabled: true max_entries: 50000 ttl: 24h
Monitoring cache performance
Check cache hit rates via the API:
curl http://localhost:4200/api/proxy/modules/cache # Returns hit/miss counts, hit rate %, and memory usage