Live

AI Eval Lab

The workflow real AI teams use to ensure prompt reliability — made interactive

View Live View Code

Built in One overnight session (March 1, 2026)

Next.jsClaude APITypeScriptTailwind CSSSSE Streaming

Concurrent test evaluation

Keyword-based scoring

SSE streaming suggestions

Pre-populated examples

01The Problem

Most AI developers write a prompt, test it manually once or twice, ship it, and wonder why outputs are inconsistent in production. Systematic prompt evaluation — defining test cases, measuring pass rates, iterating on failures — is standard practice at serious AI companies but almost invisible in the developer tooling space. The opportunity was to make this workflow interactive and visual.

02The Approach

Built a two-panel UI: left side is the prompt editor (system prompt + user template with {{variable}} placeholders) and test case builder; right side shows results with a score ring, a detailed results table, and a streaming 'Fix My Prompt' feature. Test cases define input variable values and expected output keywords. All test cases run against Claude Haiku simultaneously. Each result is scored by keyword match rate (pass = 70%+). When failures occur, Claude analyzes them and streams specific prompt improvement suggestions.

03Architecture Decisions

Concurrent test case evaluation

The evaluate API route runs all test cases in parallel using Promise.all, sending each as an independent Claude Haiku call. Results stream back as a JSON array. This means a 5-test evaluation completes in roughly the time of one request, not five sequential requests. Each result includes the full LLM response, matched keywords, missed keywords, and a numeric score.

Keyword-based scoring with configurable pass threshold

Scoring is based on keyword presence rather than semantic similarity — deterministic, fast, and transparent. Expected keywords are checked case-insensitively against the LLM response. Score = matched / total expected keywords. Pass threshold is 70%. This approach makes scoring legible: you can see exactly which keywords were found or missed in each result row.

SSE streaming for the Fix My Prompt feature

When you click 'Fix My Prompt', failed test cases (inputs, expected outputs, actual outputs) are sent to a Claude Sonnet call that analyzes the pattern of failures and streams improvement suggestions. The fix-prompt route emits proper SSE events (data: {json}\n\n format) so suggestions appear word-by-word rather than all at once. This makes the diagnostic reasoning visible as it's produced.

Pre-populated examples for immediate usability

The app ships with three working test cases using a summarization prompt: Rayleigh scattering, Paris as capital of France, machine learning patterns. Users can run the default setup immediately to see how the tool works before modifying it. This removes the blank-canvas problem — every demo starts in a usable state.

04Key Insight

The most sophisticated AI teams don't just 'write good prompts' — they run evaluation suites. This app makes that practice visible and accessible. The 'Fix My Prompt' feature is the key differentiator: it's not just showing you what failed, it's analyzing the failure pattern and telling you what to change in the system prompt to fix it.

05Why It Matters

Demonstrates evaluation culture — the practice that distinguishes production AI systems from demos. Shows understanding of reliability engineering for AI features: defining expected behavior, measuring it systematically, and iterating on failures. A strong interview signal for any AI engineering role.

Back to projects

Live demo GitHub