AI Eval Lab
The workflow real AI teams use to ensure prompt reliability — made interactive
Concurrent test evaluation
Keyword-based scoring
SSE streaming suggestions
Pre-populated examples
01The Problem
Most AI developers write a prompt, test it manually once or twice, ship it, and wonder why outputs are inconsistent in production. Systematic prompt evaluation — defining test cases, measuring pass rates, iterating on failures — is standard practice at serious AI companies but almost invisible in the developer tooling space. The opportunity was to make this workflow interactive and visual.
02The Approach
Built a two-panel UI: left side is the prompt editor (system prompt + user template with {{variable}} placeholders) and test case builder; right side shows results with a score ring, a detailed results table, and a streaming 'Fix My Prompt' feature. Test cases define input variable values and expected output keywords. All test cases run against Claude Haiku simultaneously. Each result is scored by keyword match rate (pass = 70%+). When failures occur, Claude analyzes them and streams specific prompt improvement suggestions.
03Architecture Decisions
Concurrent test case evaluation
The evaluate API route runs all test cases in parallel using Promise.all, sending each as an independent Claude Haiku call. Results stream back as a JSON array. This means a 5-test evaluation completes in roughly the time of one request, not five sequential requests. Each result includes the full LLM response, matched keywords, missed keywords, and a numeric score.
Keyword-based scoring with configurable pass threshold
Scoring is based on keyword presence rather than semantic similarity — deterministic, fast, and transparent. Expected keywords are checked case-insensitively against the LLM response. Score = matched / total expected keywords. Pass threshold is 70%. This approach makes scoring legible: you can see exactly which keywords were found or missed in each result row.
SSE streaming for the Fix My Prompt feature
When you click 'Fix My Prompt', failed test cases (inputs, expected outputs, actual outputs) are sent to a Claude Sonnet call that analyzes the pattern of failures and streams improvement suggestions. The fix-prompt route emits proper SSE events (data: {json}\n\n format) so suggestions appear word-by-word rather than all at once. This makes the diagnostic reasoning visible as it's produced.
Pre-populated examples for immediate usability
The app ships with three working test cases using a summarization prompt: Rayleigh scattering, Paris as capital of France, machine learning patterns. Users can run the default setup immediately to see how the tool works before modifying it. This removes the blank-canvas problem — every demo starts in a usable state.
04Key Insight
The most sophisticated AI teams don't just 'write good prompts' — they run evaluation suites. This app makes that practice visible and accessible. The 'Fix My Prompt' feature is the key differentiator: it's not just showing you what failed, it's analyzing the failure pattern and telling you what to change in the system prompt to fix it.
05Why It Matters
Demonstrates evaluation culture — the practice that distinguishes production AI systems from demos. Shows understanding of reliability engineering for AI features: defining expected behavior, measuring it systematically, and iterating on failures. A strong interview signal for any AI engineering role.