Voice AI Demo
A voice AI assistant that proves browsers are more powerful than you think
<1s perceived latency
Browser-native STT + TTS
Full conversation context
Streaming sentence-by-sentence synthesis
01The Problem
Voice AI in the browser is underexplored — most developers reach for Deepgram or Whisper immediately. I wanted to understand what the native browser APIs actually give you, build the full voice pipeline end-to-end, and understand where the real latency bottlenecks are (spoiler: not the LLM).
02The Approach
Web Speech API handles both capture and playback entirely in the browser. Claude Haiku via streaming fetch handles generation. The key insight: start TTS synthesis as each sentence arrives (not after the full response) to achieve sub-1-second perceived latency. The LLM call took 30 minutes to build. The voice pipeline took 4 hours.
03Architecture Decisions
Pipeline streaming for low latency
The response streams from Claude → sentence boundary detection → SpeechSynthesis queue. Users hear the first sentence while Claude generates the second. Perceived response time drops from 2-3 seconds (full response) to under 1 second (first sentence).
System prompt tuned for speech
Standard LLM responses are optimized for reading — long paragraphs, bullet points, markdown formatting. The system prompt constrains responses to under 3 sentences in conversational language. This single constraint changes the character of responses from 'chatbot writing' to 'person talking'.
Graceful browser compatibility handling
SpeechRecognition is Chromium-only (Chrome/Edge). The demo detects API availability on load and shows a clear compatibility warning on unsupported browsers rather than silently failing.
04Key Insight
The bottleneck in voice AI isn't the LLM — it's the voice pipeline. Silence detection timing, sentence boundary detection for streaming TTS, connection recovery when recognition cuts out, and voice selection UX all took longer than the Claude integration. Building voice AI reveals where the real complexity lives.
05Why It Matters
Demonstrates the full voice AI stack — the same problem space companies like Giga ($61M Series A, Vancouver) are solving at enterprise scale. Shows understanding of where the hard engineering problems are: latency, voice tuning, and browser compatibility — not just LLM integration.