76 Modules, 400 Nanoseconds: Benchmarking an LLM Middleware Chain
I put 76 middleware modules between my application and OpenAI. PII redaction, cost caps, caching, rate limiting, prompt injection detection, audit logging, content filtering, circuit breaking, billing metering — the full production stack. Then I measured the overhead.
This isn’t a synthetic micro-benchmark on a toy module. This is the actual chain Stockyard runs in production: 76 real modules with real logic, executing on every single request. Here’s how it works, and why the numbers look the way they do.
The Test Setup
All benchmarks were run on a Xeon Platinum 8581C (Sapphire Rapids), 64 cores, running Go 1.22 with CGO_ENABLED=0. The binary is compiled with pure Go SQLite. No CGO, no external C libraries.
We measured three things separately:
| Benchmark | What It Measures | Result |
|---|---|---|
| Full chain execution | All 76 modules firing in sequence | 400ns |
| Toggle-aware chain | Chain with per-module enable/disable checks | 1.56μs |
| Registry lookup | Finding a module by name in the registry | 23.1ns |
The difference between 400ns and 1.56μs is interesting. The toggle-aware chain checks a boolean flag for each module before executing it. That check plus the branch prediction miss costs ~17ns per module × 76 = ~1.1μs additional. The raw chain skips the check entirely.
In practice you use the toggle-aware chain because you want to enable and disable modules at runtime. 1.56 microseconds is still nothing compared to a 1-second LLM call.
What 76 Modules Actually Do
These aren’t stubs. Each module runs real logic on every request or response. Here’s the breakdown by category:
| Category | Count | Examples |
|---|---|---|
| Safety | 10 | firewall, prompt guard, toxicity filter, hallucination check, secret scanner, agent guard |
| General | 9 | PII redactor, content filter, prompt injection detection, cache, retry, rate limiter |
| Cost control | 7 | spend tracker, tier drop, hard caps, idle kill, output cap, usage pulse, cost warning |
| Observability | 7 | response headers, auto-tagger, LLM tap, structured logging, trace linking, alert pulse, drift watch |
| Transform | 7 | prompt slimming, token trimming, context packing, chat memory injection, language bridge |
| Routing | 6 | failover, model switching, region routing, A/B routing, smart route |
| Tack Room | 3 | prompt pad (template injection), prompt linting, approval gate |
| Other | 17 | billing meter, embed cache, tenant isolation, IP fencing, compliance log, shims, validators |
Every module follows the same interface: it receives the request context, can modify it, can short-circuit the chain, and passes control to the next module. The chain executes synchronously — no goroutines per module, no channels, no async coordination.
Why It’s Fast
Three architecture decisions make this possible.
1. Function Composition, Not Interfaces
Each middleware is a func(next Handler) Handler. The chain is built once at startup by composing all 76 functions into a single nested closure. At runtime, calling the chain is a sequence of function pointer dereferences — no virtual dispatch, no interface lookups, no reflection.
// The chain is composed once at boot. After that, it's just function calls. chain := handler for i := len(modules) - 1; i >= 0; i-- { chain = modules[i](chain) }
Go’s compiler can inline small closures. When a module’s hot path is a few comparisons and a function call, the overhead per module approaches a single branch instruction.
2. Zero-Allocation Toggle Registry
Every module has a toggle: enabled or disabled, changeable at runtime via the API. The registry stores these as a flat array of booleans, not a map. Checking if module N is enabled is an array index, not a hash lookup.
// Toggle check is an array index: O(1), no allocation, no hash. if !registry.enabled[moduleIndex] { return next.Handle(ctx, req) // skip this module }
The toggle array fits in a single cache line for up to 64 modules. At 76 modules we spill into a second line, but the CPU prefetcher handles this predictably. This is why the registry lookup benchmarks at 23.1ns — it’s a single cache-line read.
3. No Allocations in the Hot Path
The request context is passed by pointer. Modules that need to attach metadata use context.WithValue, which allocates, but we minimize this by batching metadata into a single struct allocated once per request. Modules that don’t modify the context pass the pointer through unchanged.
We verified this with go test -benchmem: the full 76-module chain allocates 0 bytes per operation when all modules are enabled but no module triggers a side effect (the “passthrough” case). Real requests do allocate — PII redaction needs to copy and modify strings, caching needs to serialize — but the chain overhead itself is allocation-free.
Real World Numbers
Benchmarks are one thing. Production traffic is another. We’ve processed over 1,000 real requests through the full chain on the live deployment. Here’s what the observability data shows:
$0.84 total LLM cost tracked
~1,200ms average request latency (dominated by OpenAI response time)
<0.002ms middleware overhead (below measurement threshold in traces)
The middleware overhead is literally unmeasurable in production traces because it’s below the 1ms resolution of the trace timer. It disappears into noise. The entire request lifecycle is dominated by the LLM provider’s response time, which is typically 500ms to 3 seconds depending on the model and prompt length.
What This Means Practically
If you’re building on LLMs, you need most of these modules. Cost caps alone save money. Caching reduces latency by 10-100x for repeated queries. PII redaction is a legal requirement in many jurisdictions. Rate limiting prevents runaway bills.
The traditional approach is to add each concern as a separate service: a proxy here, an observability tool there, a billing webhook somewhere else. Each hop adds 1-10ms of network latency. Four separate services in the request path add 4-40ms — orders of magnitude more than running all 76 modules in-process.
That’s the core argument for a monolithic middleware chain: 76 modules in-process at 400ns beats 4 services over the network at 4-40ms. You get more functionality with less latency.
The Tradeoffs
This architecture isn’t free. There are real tradeoffs.
Single process means single point of failure. If the proxy crashes, everything goes down. We mitigate this with Railway’s auto-restart and health checks, but it’s not the same as a distributed mesh of independent services.
SQLite limits write concurrency. With WAL mode, you get unlimited concurrent readers but only one writer. For a proxy that’s mostly reads (config lookups, cache checks) this works well. But high-volume trace writing can become a bottleneck. We use rollup tables to batch writes.
Monolithic binary means monolithic deploys. Updating one module means redeploying the entire binary. In practice this takes 30 seconds on Railway, so it’s not painful, but it’s different from updating a single microservice.
For a solo developer or small team running up to tens of thousands of requests per minute, these tradeoffs are overwhelmingly favorable. You don’t need Kubernetes for an LLM proxy.
Try It
Stockyard's proxy core is open source (Apache 2.0). The full platform — all 76 modules, 150 tools, all 16 provider integrations — has a generous free tier.
curl -sSL https://stockyard.dev/install.sh | sh stockyard # Running on http://localhost:4200 # Proxy: http://localhost:4200/v1/chat/completions
Point your OpenAI client at http://localhost:4200/v1 and every request flows through the full middleware chain. Toggle modules on and off at runtime via the API or the web console.
The benchmarks page has more detail: stockyard.dev/benchmarks. The source is on GitHub.