Do I need to set up a test harness for replay?

No. Stockyard logs every request automatically. Replay works on any logged request with one API call. No test fixtures, no mock data, no configuration.

Can I replay against models from different providers?

Yes. Replay works across any provider Stockyard supports. Replay an OpenAI request against Anthropic, Google, DeepSeek, or any of the 40+ supported providers.

Is request replay available on the free tier?

Replay (Lasso) is available on the a paid platform tier ($29.99/mo) and above. The free tier includes request logging so your data is captured, and you can upgrade to access replay when ready.

Stockyard LLM Request Replay — Re-run Requests Against Different Models

Why replay matters

You want to switch from GPT-4o to Claude or DeepSeek. But will the output quality hold? The only way to know is to test with your actual production requests, not synthetic benchmarks.

Request replay takes a real request from your logs, sends it to a different model, and shows you the original and new response side by side. You compare cost, latency, token count, and output quality on your own data.

How it works in Stockyard

Every request through Stockyard is logged with the full prompt, response, model, tokens, cost, and latency. Lasso (Stockyard's replay engine) lets you pick any logged request and re-run it:

# Replay a request against a different model
curl -X POST http://localhost:4200/api/replay \
  -d '{"trace_id": "tr_a8f21c4e", "model": "claude-sonnet-4-5-20250929"}'

# Compare the results
curl http://localhost:4200/api/replay/compare/tr_a8f21c4e
# Original: gpt-4o, $0.0045, 1.2s, 342 tokens
# Replay:   claude-sonnet-4-5, $0.0038, 0.9s, 318 tokens

Use cases

Provider migration. Before switching from OpenAI to Anthropic, replay 100 production requests and compare quality. Make the decision with data, not guesswork.

Cost optimization. Replay your most expensive requests against cheaper models. Find out which requests can safely use DeepSeek or Gemini Flash instead of GPT-4o.

Regression testing. After changing prompts, replay historical requests to verify the new prompt produces equivalent or better output.

Shareable comparisons. Generate a share link for any replay comparison. Send it to your team to discuss whether the switch makes sense.

How this differs from benchmarks

Public benchmarks test models on standardized tasks. Replay tests models on your actual workload. A model that scores well on MMLU might perform poorly on your specific prompt patterns. Replay gives you the answer for your data.