Caching

Cache identical requests to save money and reduce latency.

Enable caching

Caching is disabled by default. Enable it via the API:

curl -X PUT http://localhost:4200/api/proxy/modules/cache \
  -d '{"enabled": true}'

Or in your config file:

cache:
  enabled: true
  strategy: exact      # exact or semantic
  ttl: 1h
  max_entries: 10000

How it works

When a request comes in, Stockyard hashes the model name and message array. If a matching response exists in cache and the TTL has not expired, the cached response is returned immediately. No API call is made to the provider.

Cache keys include the user ID by default to prevent cross-user cache leakage. Anonymous requests share a cache.

Cache strategies

exact matches on identical model + messages. This is the default and works well for tab completions, classification tasks, and any workload with repeated inputs.

semantic uses embedding similarity to match "close enough" requests. This catches requests that are semantically identical but differ in whitespace, punctuation, or minor wording. Higher hit rate but slightly more overhead.

Response headers

Cached responses include X-Stockyard-Cache: HIT in the response headers. Cache misses include X-Stockyard-Cache: MISS. Your application can use these to track cache effectiveness.

Cache bypass

To bypass the cache for a specific request, add the header X-Stockyard-No-Cache: true to your request. The response will still be cached for future requests.

Embedding cache

Embedding requests (/v1/embeddings) have their own cache keyed by content hash. Enable it separately:

embedcache:
  enabled: true
  max_entries: 50000
  ttl: 24h

Monitoring cache performance

Check cache hit rates via the API:

curl http://localhost:4200/api/proxy/modules/cache
# Returns hit/miss counts, hit rate %, and memory usage