Chute

Proxy & middleware. 76 modules, 16 providers, runtime toggles.

Overview

The proxy is the gateway layer. Every LLM request passes through a middleware chain of 76 toggleable modules. The proxy is OpenAI API-compatible — any SDK that talks to /v1/chat/completions works out of the box.

If you only need the proxy layer without tracing or audit, see proxy-only mode.

Modules

Modules are organized by category:

Category	Modules	Purpose
routing	fallbackrouter, modelswitch, regionroute, localsync, abrouter	Provider failover, model aliasing, geo routing
caching	cachelayer, embedcache, semanticcache	LLM cache for responses and embeddings
cost	costcap, tierdrop, idlekill, outputcap, usagepulse, rateshield	Spending limits, LLM rate limiting, usage reporting
safety	promptguard, toxicfilter, guardrail, agegate, hallucicheck, secretscan, agentguard	Content moderation, injection detection, PII
transform	promptslim, tokentrim, contextpack, chatmem, langbridge, voicebridge	Prompt compression, context management
validate	structuredshield, evalgate, codefence	JSON validation, quality gating
shims	anthrofit, geminishim	Use Claude/Gemini with OpenAI SDK
observe	llmtap, tracelink, alertpulse, driftwatch	Logging, tracing, alerting, drift detection

Every module is wrapped with toggle.Wrap and checks enabled state on every request. Disable a module and it’s bypassed instantly.

See the interactive module visualization for the full dependency graph.

Providers

Stockyard supports 16 LLM providers out of the box. Set an environment variable and the provider is auto-configured on boot:

Provider	Env Var	Models
OpenAI	`OPENAI_API_KEY`	gpt-4o, gpt-4.1, o3-mini, etc.
Anthropic	`ANTHROPIC_API_KEY`	claude-sonnet-4-5, claude-haiku-4-5
Google Gemini	`GEMINI_API_KEY`	gemini-2.5-pro, gemini-2.0-flash
Groq	`GROQ_API_KEY`	llama-3.3-70b, mixtral-8x7b
Mistral	`MISTRAL_API_KEY`	mistral-large, codestral
DeepSeek	`DEEPSEEK_API_KEY`	deepseek-chat, deepseek-reasoner
Together	`TOGETHER_API_KEY`	Llama 3.1, Qwen 2.5
Fireworks	`FIREWORKS_API_KEY`	Llama 3.3, Qwen 2.5
Perplexity	`PERPLEXITY_API_KEY`	sonar-pro, sonar
xAI	`XAI_API_KEY`	grok-3, grok-2
Cohere	`COHERE_API_KEY`	command-r-plus, command-a
OpenRouter	`OPENROUTER_API_KEY`	Any model via OpenRouter
Replicate	`REPLICATE_API_TOKEN`	Any model via Replicate
Azure OpenAI	`AZURE_OPENAI_API_KEY`	Azure-hosted models
Ollama	(auto at :11434)	Any local model
LM Studio	(auto at :1234)	Any local model

Any OpenAI-compatible API works as a custom provider via the user settings or config file.

Routes

Routes map model patterns to providers. When a request comes in for gpt-4o, the router checks the routes table and sends it to the matching provider.

# List routes
curl http://localhost:4200/api/proxy/routes

Toggling Modules

Enable or disable any module at runtime without restart:

# Enable the response cache
curl -X PUT http://localhost:4200/api/proxy/modules/cachelayer \
  -H "Authorization: Bearer sy_admin_..." \
  -H "Content-Type: application/json" \
  -d '{"enabled": true}'

# Disable toxicity filtering
curl -X PUT http://localhost:4200/api/proxy/modules/toxicfilter \
  -H "Authorization: Bearer sy_admin_..." \
  -H "Content-Type: application/json" \
  -d '{"enabled": false}'

Module state changes take effect on the next request. No restart needed.

Module Configuration

Modules accept configuration alongside the enabled flag:

# Configure costcap with a $10/day limit
curl -X PUT http://localhost:4200/api/proxy/modules/costcap \
  -H "Authorization: Bearer sy_admin_..." \
  -H "Content-Type: application/json" \
  -d '{
    "enabled": true,
    "config": {
      "daily_limit_usd": 10.00,
      "action": "block",
      "notify_at_pct": 80
    }
  }'

Provider Shims

Shim modules translate between the OpenAI API format and provider-native formats. Enable a shim and route requests to any provider using the standard OpenAI SDK:

# Route Claude requests through the anthrofit shim
curl http://localhost:4200/v1/chat/completions \
  -H "Authorization: Bearer sy_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5",
    "messages": [{"role": "user", "content": "Hello from the OpenAI SDK!"}]
  }'

Available shims: anthrofit, geminishim, groqshim, ollamashim, bedrockshim, azureshim, mistralshim, cohereshim, togethershim, deepseekshim, fireworksshim, replicateshim, perplexityshim.

Failover & Routing

Configure automatic provider failover with the fallbackrouter module:

# stockyard.yaml
modules:
  fallbackrouter:
    enabled: true
    config:
      chain:
        - provider: openai
          model: gpt-4o
        - provider: anthropic
          model: claude-sonnet-4-5
        - provider: groq
          model: llama-3.3-70b-versatile
      max_retries: 2

If OpenAI fails or is rate-limited, the request automatically falls back to Anthropic, then Groq.

Streaming

Streaming works transparently. Set "stream": true in the request body:

curl http://localhost:4200/v1/chat/completions \
  -H "Authorization: Bearer sy_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Tell me a joke"}],
    "stream": true
  }'

The streamsnap module captures the full streamed response for logging without affecting delivery latency.

Embeddings

The proxy also handles embedding requests:

curl http://localhost:4200/v1/embeddings \
  -H "Authorization: Bearer sy_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-3-small",
    "input": "Stockyard is an LLM proxy"
  }'

The embedcache module caches embedding results for identical inputs, saving both time and cost.

Inspecting Module State

Query the full module list with current status and config:

curl http://localhost:4200/api/proxy/modules \
  -H "Authorization: Bearer sy_admin_..."

{
  "modules": [
    {"name": "fallbackrouter", "enabled": true, "category": "routing", "order": 1},
    {"name": "cachelayer", "enabled": true, "category": "caching", "order": 13},
    {"name": "costcap", "enabled": false, "category": "cost", "order": 16}
  ],
  "total": 70,
  "enabled": 24
}

Caching Patterns

Stockyard offers three caching layers, each suited to different use cases:

Module	Match Strategy	Best For
`cachelayer`	Exact SHA-256 hash	Identical repeated queries (chatbots, FAQ)
`embedcache`	Embedding vector hash	Embedding deduplication
`semanticcache`	Cosine similarity threshold	Similar but not identical queries

Configure cache TTL and similarity thresholds per module:

# stockyard.yaml
modules:
  cachelayer:
    enabled: true
    config:
      ttl: 3600
      max_entries: 10000
  semanticcache:
    enabled: true
    config:
      similarity_threshold: 0.92
      embedding_model: "text-embedding-3-small"

Safety Module Chain

For production deployments, enable the full safety chain:

# Enable all safety modules at once
for module in promptguard toxicfilter guardrail secretscan agentguard; do
  curl -X PUT http://localhost:4200/api/proxy/modules/$module \
    -H "Authorization: Bearer sy_admin_..." \
    -d '{"enabled": true}'
done

Performance: The full 76-module chain adds ~4ms of overhead. Safety modules are the heaviest at 0.06–0.35ms each, but still negligible compared to LLM provider latency (500ms–3s).

Rate Limiting

The rateshield module enforces per-user rate limits using a token bucket algorithm:

# stockyard.yaml
modules:
  rateshield:
    enabled: true
    config:
      rpm: 60           # Requests per minute
      burst: 10          # Allow burst of 10 above limit
      per: "api_key"     # Rate limit per API key

When rate-limited, the proxy returns HTTP 429 with a Retry-After header indicating when the client can retry.

← Auth & Configuration Lookout →