Take any past request, re-run it against a different model, and compare the results. No code changes, no test harness.
You want to switch from GPT-4o to Claude or DeepSeek. But will the output quality hold? The only way to know is to test with your actual production requests, not synthetic benchmarks.
Request replay takes a real request from your logs, sends it to a different model, and shows you the original and new response side by side. You compare cost, latency, token count, and output quality on your own data.
Every request through Stockyard is logged with the full prompt, response, model, tokens, cost, and latency. Lasso (Stockyard's replay engine) lets you pick any logged request and re-run it:
# Replay a request against a different model curl -X POST http://localhost:4200/api/replay \ -d '{"trace_id": "tr_a8f21c4e", "model": "claude-sonnet-4-5-20250929"}' # Compare the results curl http://localhost:4200/api/replay/compare/tr_a8f21c4e # Original: gpt-4o, $0.0045, 1.2s, 342 tokens # Replay: claude-sonnet-4-5, $0.0038, 0.9s, 318 tokens
Provider migration. Before switching from OpenAI to Anthropic, replay 100 production requests and compare quality. Make the decision with data, not guesswork.
Cost optimization. Replay your most expensive requests against cheaper models. Find out which requests can safely use DeepSeek or Gemini Flash instead of GPT-4o.
Regression testing. After changing prompts, replay historical requests to verify the new prompt produces equivalent or better output.
Shareable comparisons. Generate a share link for any replay comparison. Send it to your team to discuss whether the switch makes sense.
Public benchmarks test models on standardized tasks. Replay tests models on your actual workload. A model that scores well on MMLU might perform poorly on your specific prompt patterns. Replay gives you the answer for your data.
Try Stockyard. One binary, 16 providers, under 60 seconds.
Get StartedProxy-only mode · Pricing · Self-hosted proxy · Best self-hosted proxy