I Built a Voice AI Assistant Using Only Browser APIs (And It's More Interesting Than You'd Think)

Your browser ships with a speech recognition engine and a text-to-speech synthesizer, both accessible in JavaScript with zero dependencies. I used them — plus Claude — to build a voice AI assistant with under-1-second response latency. The LLM call took 30 minutes. The voice pipeline took 4 hours.

The technical stack

Voice AI Demo has three layers:

Capture: SpeechRecognition for real-time speech-to-text in the browser. No Whisper, no Deepgram, no external API.
Generation: Claude Haiku, accessed via raw fetch streaming from a Next.js API route.
Synthesis: SpeechSynthesis to read Claude's response aloud, starting as soon as the first sentence is complete — not waiting for the full response.

The voice processing is entirely client-side. Only the text goes to my API route (and then to Anthropic). No audio leaves the browser.

The interesting engineering problem: latency

Building this, I assumed the bottleneck would be the LLM. It isn't — Claude Haiku responds in under 200ms. The bottleneck is the speech pipeline:

SpeechRecognition silence detection: The browser waits for a pause (typically 1–2 seconds of silence) before finalizing speech. Too sensitive and it cuts you off; too lenient and it feels sluggish.
First-token latency: Even fast models add 100–200ms before the first token appears. At voice interaction speeds, this is perceptible.
Speech synthesis queue: If you try to speak the full response at once, users wait. The fix: stream Claude's response, detect sentence boundaries (., ?, !), and start speaking each sentence as it arrives.

The result: from the moment you stop talking to the moment the assistant starts responding is under 1 second. That's the threshold that makes voice feel natural.

What the Web Speech API actually gives you

The SpeechRecognition API is more capable than most developers realize:

const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
recognition.continuous = false;      // Stop after one utterance
recognition.interimResults = true;   // Show live transcript as you speak
recognition.lang = 'en-US';

With interimResults: true, you can display a live transcript while the user is still speaking — exactly like Google Voice Search. The interim results are rough, but the final result is surprisingly accurate.

SpeechSynthesis is simpler but has one critical gotcha: it stops speaking if you call it too quickly. The pattern is to queue sentences:

function speakSentences(text: string) {
  const sentences = text.split(/(?<=[.!?])\s+/);
  sentences.forEach(sentence => {
    const utterance = new SpeechSynthesisUtterance(sentence);
    utterance.voice = selectedVoice;
    utterance.rate = 1.1;  // Slightly faster than default
    window.speechSynthesis.speak(utterance);
  });
}

System prompt design for voice

Claude's responses needed to be appropriate for being spoken, not read. The problem with most AI responses is they're optimized for text — long paragraphs, bullet points, markdown formatting. None of that translates to speech.

My system prompt:

You are a helpful voice assistant. Give clear, concise responses 
suitable for being spoken aloud. Keep responses under 3 sentences 
unless the user asks for detail. Don't use markdown formatting. 
Don't use bullet points. Write as you would speak.

The explicit "3 sentences unless asked for detail" constraint changed the character of responses entirely. The assistant sounds like a person talking, not a chatbot writing.

The browser compatibility situation

SpeechRecognition only works in Chromium browsers (Chrome, Edge, Arc). Firefox doesn't support it. Safari has partial support but it's less reliable.

For a production voice AI product, you'd use a proper STT service — Deepgram for low-latency streaming, Google Speech-to-Text for accuracy, or Whisper for privacy. The browser's SpeechRecognition is fine for demos but isn't enterprise-grade.

SpeechSynthesis is better supported but the voices are limited to whatever's installed on the user's OS. On Mac/Chrome you get ~80 voices including decent neural voices; on Linux you might get one robotic voice. For production, you'd use ElevenLabs or AWS Polly.

What production voice AI actually requires

Building this taught me exactly why companies building enterprise voice AI at scale are genuinely hard:

Silence detection in noisy environments: In a call center, there's background noise. Silence detection that works in a quiet office fails when keyboards are clicking. Enterprise voice AI has solved this at scale.
Latency at telephony quality: WebRTC over the internet adds 50–200ms. PSTN telephony adds more, and codec compression degrades audio. Getting sub-3-second total round-trip time at telephony quality is genuinely hard.
Language diversity: My demo works in whatever language SpeechRecognition supports. 99-language support at enterprise quality is a massive engineering challenge.

The ratio that matters

The LLM integration took 30 minutes. The voice pipeline — silence detection timing, sentence boundary detection for streaming TTS, voice selection UX, connection recovery — took 4 hours.

That ratio is the point. Voice AI isn't just LLM integration. The interesting engineering is in the edges: getting the audio pipeline to behave reliably across browsers, handle interruptions gracefully, and feel natural to users who don't know or care about the technical complexity behind the mic button.

For production voice AI: use Deepgram for STT (best latency), Claude for generation (best instruction-following at voice-appropriate length), and ElevenLabs for TTS (best quality). This demo uses browser APIs for the audio layers, which is fine for learning but not production-grade.

Try it here — Chrome or Edge required. Say anything. The assistant will answer in under a second.