Chute
Proxy & middleware. 76 modules, 16 providers, runtime toggles.
Overview
The proxy is the gateway layer. Every LLM request passes through a middleware chain of 76 toggleable modules. The proxy is OpenAI API-compatible — any SDK that talks to /v1/chat/completions works out of the box.
If you only need the proxy layer without tracing or audit, see proxy-only mode.
Modules
Modules are organized by category:
| Category | Modules | Purpose |
|---|---|---|
| routing | fallbackrouter, modelswitch, regionroute, localsync, abrouter | Provider failover, model aliasing, geo routing |
| caching | cachelayer, embedcache, semanticcache | LLM cache for responses and embeddings |
| cost | costcap, tierdrop, idlekill, outputcap, usagepulse, rateshield | Spending limits, LLM rate limiting, usage reporting |
| safety | promptguard, toxicfilter, guardrail, agegate, hallucicheck, secretscan, agentguard | Content moderation, injection detection, PII |
| transform | promptslim, tokentrim, contextpack, chatmem, langbridge, voicebridge | Prompt compression, context management |
| validate | structuredshield, evalgate, codefence | JSON validation, quality gating |
| shims | anthrofit, geminishim | Use Claude/Gemini with OpenAI SDK |
| observe | llmtap, tracelink, alertpulse, driftwatch | Logging, tracing, alerting, drift detection |
Every module is wrapped with toggle.Wrap and checks enabled state on every request. Disable a module and it’s bypassed instantly.
See the interactive module visualization for the full dependency graph.
Providers
Stockyard supports 16 LLM providers out of the box. Set an environment variable and the provider is auto-configured on boot:
| Provider | Env Var | Models |
|---|---|---|
| OpenAI | OPENAI_API_KEY | gpt-4o, gpt-4.1, o3-mini, etc. |
| Anthropic | ANTHROPIC_API_KEY | claude-sonnet-4-5, claude-haiku-4-5 |
| Google Gemini | GEMINI_API_KEY | gemini-2.5-pro, gemini-2.0-flash |
| Groq | GROQ_API_KEY | llama-3.3-70b, mixtral-8x7b |
| Mistral | MISTRAL_API_KEY | mistral-large, codestral |
| DeepSeek | DEEPSEEK_API_KEY | deepseek-chat, deepseek-reasoner |
| Together | TOGETHER_API_KEY | Llama 3.1, Qwen 2.5 |
| Fireworks | FIREWORKS_API_KEY | Llama 3.3, Qwen 2.5 |
| Perplexity | PERPLEXITY_API_KEY | sonar-pro, sonar |
| xAI | XAI_API_KEY | grok-3, grok-2 |
| Cohere | COHERE_API_KEY | command-r-plus, command-a |
| OpenRouter | OPENROUTER_API_KEY | Any model via OpenRouter |
| Replicate | REPLICATE_API_TOKEN | Any model via Replicate |
| Azure OpenAI | AZURE_OPENAI_API_KEY | Azure-hosted models |
| Ollama | (auto at :11434) | Any local model |
| LM Studio | (auto at :1234) | Any local model |
Any OpenAI-compatible API works as a custom provider via the user settings or config file.
Routes
Routes map model patterns to providers. When a request comes in for gpt-4o, the router checks the routes table and sends it to the matching provider.
# List routes curl http://localhost:4200/api/proxy/routes
Toggling Modules
Enable or disable any module at runtime without restart:
# Enable the response cache curl -X PUT http://localhost:4200/api/proxy/modules/cachelayer \ -H "Authorization: Bearer sy_admin_..." \ -H "Content-Type: application/json" \ -d '{"enabled": true}' # Disable toxicity filtering curl -X PUT http://localhost:4200/api/proxy/modules/toxicfilter \ -H "Authorization: Bearer sy_admin_..." \ -H "Content-Type: application/json" \ -d '{"enabled": false}'
Module state changes take effect on the next request. No restart needed.
Module Configuration
Modules accept configuration alongside the enabled flag:
# Configure costcap with a $10/day limit curl -X PUT http://localhost:4200/api/proxy/modules/costcap \ -H "Authorization: Bearer sy_admin_..." \ -H "Content-Type: application/json" \ -d '{ "enabled": true, "config": { "daily_limit_usd": 10.00, "action": "block", "notify_at_pct": 80 } }'
Provider Shims
Shim modules translate between the OpenAI API format and provider-native formats. Enable a shim and route requests to any provider using the standard OpenAI SDK:
# Route Claude requests through the anthrofit shim curl http://localhost:4200/v1/chat/completions \ -H "Authorization: Bearer sy_..." \ -H "Content-Type: application/json" \ -d '{ "model": "claude-sonnet-4-5", "messages": [{"role": "user", "content": "Hello from the OpenAI SDK!"}] }'
Available shims: anthrofit, geminishim, groqshim, ollamashim, bedrockshim, azureshim, mistralshim, cohereshim, togethershim, deepseekshim, fireworksshim, replicateshim, perplexityshim.
Failover & Routing
Configure automatic provider failover with the fallbackrouter module:
# stockyard.yaml
modules:
fallbackrouter:
enabled: true
config:
chain:
- provider: openai
model: gpt-4o
- provider: anthropic
model: claude-sonnet-4-5
- provider: groq
model: llama-3.3-70b-versatile
max_retries: 2
If OpenAI fails or is rate-limited, the request automatically falls back to Anthropic, then Groq.
Streaming
Streaming works transparently. Set "stream": true in the request body:
curl http://localhost:4200/v1/chat/completions \
-H "Authorization: Bearer sy_..." \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Tell me a joke"}],
"stream": true
}'
The streamsnap module captures the full streamed response for logging without affecting delivery latency.
Embeddings
The proxy also handles embedding requests:
curl http://localhost:4200/v1/embeddings \
-H "Authorization: Bearer sy_..." \
-H "Content-Type: application/json" \
-d '{
"model": "text-embedding-3-small",
"input": "Stockyard is an LLM proxy"
}'
The embedcache module caches embedding results for identical inputs, saving both time and cost.
Inspecting Module State
Query the full module list with current status and config:
curl http://localhost:4200/api/proxy/modules \
-H "Authorization: Bearer sy_admin_..."
{
"modules": [
{"name": "fallbackrouter", "enabled": true, "category": "routing", "order": 1},
{"name": "cachelayer", "enabled": true, "category": "caching", "order": 13},
{"name": "costcap", "enabled": false, "category": "cost", "order": 16}
],
"total": 70,
"enabled": 24
}
Caching Patterns
Stockyard offers three caching layers, each suited to different use cases:
| Module | Match Strategy | Best For |
|---|---|---|
cachelayer | Exact SHA-256 hash | Identical repeated queries (chatbots, FAQ) |
embedcache | Embedding vector hash | Embedding deduplication |
semanticcache | Cosine similarity threshold | Similar but not identical queries |
Configure cache TTL and similarity thresholds per module:
# stockyard.yaml
modules:
cachelayer:
enabled: true
config:
ttl: 3600
max_entries: 10000
semanticcache:
enabled: true
config:
similarity_threshold: 0.92
embedding_model: "text-embedding-3-small"
Safety Module Chain
For production deployments, enable the full safety chain:
# Enable all safety modules at once for module in promptguard toxicfilter guardrail secretscan agentguard; do curl -X PUT http://localhost:4200/api/proxy/modules/$module \ -H "Authorization: Bearer sy_admin_..." \ -d '{"enabled": true}' done
Rate Limiting
The rateshield module enforces per-user rate limits using a token bucket algorithm:
# stockyard.yaml modules: rateshield: enabled: true config: rpm: 60 # Requests per minute burst: 10 # Allow burst of 10 above limit per: "api_key" # Rate limit per API key
When rate-limited, the proxy returns HTTP 429 with a Retry-After header indicating when the client can retry.