Writing
·5 min read

I Made Claude Haiku and Sonnet Race Each Other. The Results Are Surprising.

TTFT, tokens/second, total time — all visible, side by side, for the same prompt. I built a real-time streaming comparison tool because model selection shouldn't be based on benchmarks you can't feel. Here's what I actually learned about the Haiku/Sonnet tradeoff from watching them race.

anthropicclaudestreamingmodel selectionsseengineering

Every time I start a new AI feature, I spend 10 minutes wondering: Haiku or Sonnet? The benchmarks say Sonnet is smarter. The pricing sheet says Haiku is cheaper. But neither tells me what I actually need to know: how much slower does Sonnet feel to the user? I built a tool to make that question answerable in real time.

The model selection problem

When you're building a user-facing AI feature, there are three things that matter about model selection: cost per request, output quality, and perceived latency. Cost is easy to calculate. Quality requires judgment. Latency is where most people get it wrong.

The mistake is treating total response time as the key latency metric. For streaming AI features — which is almost everything user-facing — the number that actually matters is time to first token (TTFT): how long between the user pressing enter and seeing the first character appear.

With streaming, a response that takes 4 seconds total but starts streaming at 180ms feels fast. A response that takes 1 second total but doesn't start until 900ms feels slow. The psychological distinction is real and measurable — users consistently perceive “instant start” responses as faster, even when total generation time is longer.

I knew this intuitively from building streaming features, but I'd never actually measured the difference between Haiku and Sonnet on TTFT side by side. So I built Model Face-Off.

The architecture: parallel streams from the client

The core design decision was running both model streams simultaneously from the client, not sequentially. A naive implementation might run Haiku first, then Sonnet, and compare the results. That introduces sequential bias and makes timing comparisons meaningless — Sonnet would always appear to have its first token faster because Haiku already warmed up the network path.

Instead, the client fires two fetch requests at the exact same time:

// Both streams start simultaneously
await Promise.all([
  runStream("haiku", prompt, setHaikuState, signal),
  runStream("sonnet", prompt, setSonnetState, signal),
]);

Each fetch hits a separate API route that proxies the Anthropic streaming API and emits server-sent events. The timing measurements happen client-side: TTFT is recorded when the first text event arrives; total time when the done event fires.

Why raw fetch instead of the SDK

I use the Anthropic Node.js SDK for local development and in Claude Code. But for Vercel serverless functions, I've learned the hard way that the SDK has a streaming compatibility issue that throws a misleading authentication error at runtime, even when the API key is correctly set.

The fix is straightforward: use raw fetch() directly to api.anthropic.com/v1/messages with stream: true. The Anthropic API returns standard SSE format, which you can parse line by line:

const response = await fetch("https://api.anthropic.com/v1/messages", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "x-api-key": process.env.ANTHROPIC_API_KEY!,
    "anthropic-version": "2023-06-01",
  },
  body: JSON.stringify({
    model: "claude-haiku-4-5",
    max_tokens: 2048,
    stream: true,
    messages: [{ role: "user", content: prompt }],
  }),
});

The response body is a ReadableStream of SSE events. Each content_block_delta event with type text_delta contains the next text fragment. The message_start event contains input token count; message_delta contains output token count when the stream completes.

One thing worth noting: this pattern works in any runtime that supports fetch and the Streams API — Vercel edge functions, Cloudflare Workers, Node.js 18+, Deno. The SDK adds convenience; raw fetch adds portability.

What I actually learned from the results

After running a few hundred prompts through the comparison, some patterns emerged that changed how I think about model selection.

TTFT gap is smaller than I expected. For simple prompts, Haiku's TTFT is roughly 150-250ms vs Sonnet's 350-550ms. That's a real difference, but both feel responsive. The gap widens noticeably for complex prompts where Sonnet spends more time “thinking” before generating.

Tokens-per-second tells a different story. Once Sonnet starts generating, it's often faster token-for-token than Haiku. So Sonnet can start slower but sometimes finish at similar total times for short responses. For long responses, Haiku usually wins total time.

Output quality gap is highly task-dependent. For fact-retrieval, summarization, and extraction tasks, Haiku is close enough to Sonnet that the quality difference doesn't justify the cost increase. For reasoning tasks — debugging, architecture decisions, nuanced writing — Sonnet's outputs are noticeably better. The comparison makes this visible when you run the same prompt on both.

The right default is context-dependent. For chat interfaces where the user is waiting: Haiku for simple queries, Sonnet for anything requiring reasoning. For batch processing where speed matters more than perception: Haiku almost always. For agentic tasks where output quality compounds across steps: Sonnet.

The broader pattern: make tradeoffs visceral

The reason I build comparison and visualization demos isn't just for the portfolio. It's that I've found these tools genuinely change how I think about the systems I'm building.

Before Model Face-Off, my model selection was a guess backed by intuition. After using it for a few sessions, I have a concrete mental model of what the latency gap actually feels like and which task types benefit most from each model. That knowledge transfers to every AI feature I build.

The same pattern applies to the RAG Demo (which visualizes retrieval and generation separately), the Pipeline Demo (which shows each transformation step), and the MCP Server Demo (which makes protocol messages visible). Visibility doesn't just help users understand the system — it helps the builder understand it too.

If you're building AI-native products, consider building your own versions of these tools. Not for the portfolio. For the clarity.