Live

RAG Demo

A RAG pipeline where you can see why each chunk was retrieved

View Live View Code

Built in One overnight session

Next.jsBM25Anthropic ClaudeTypeScriptRAG

BM25 from scratch

Paragraph-aware chunking

Score-ranked retrieval

Zero hallucination via citation enforcement

01The Problem

RAG is the dominant architecture for production document AI — but most demos treat retrieval as a black box. You enter a query, magic happens, an answer appears. I wanted to build a demo where the retrieval step is fully transparent: chunk scores, matched terms, and the grounding constraint are all visible.

02The Approach

Documents are split into paragraph-aware chunks (~400 chars with overlap). At query time, BM25 (Okapi BM25, the same algorithm powering Elasticsearch) scores every chunk against the query. The top-5 chunks are sent to Claude Haiku with a citation-enforced system prompt: it can only use retrieved context, must cite sources inline. The UI shows BM25 scores as bar charts, matched query terms as tags, and streams the answer.

03Architecture Decisions

Okapi BM25 from scratch in TypeScript

Implemented BM25 (k₁=1.5, b=0.75) in pure TypeScript — no external retrieval library. The index is built in-memory per request, scoring each chunk by IDF-weighted term frequency with document-length normalization. Same algorithm as Elasticsearch and Apache Lucene. Zero external API dependencies for retrieval.

Paragraph-aware chunking with fallback

The chunker splits on double newlines first (preserving semantic units), then falls back to character-level splitting for paragraphs that exceed the chunk size. Overlap of ~80 chars prevents context loss at chunk boundaries — a critical detail that affects answer quality.

Citation-enforced generation

Claude Haiku's system prompt constrains it to use only retrieved chunks, with mandatory [Source N] inline citations. This isn't just cosmetic — it's a hallucination prevention contract. If the answer isn't in the retrieved context, Claude must say so. This is the key design decision in production document AI.

Full pipeline transparency in the UI

Every step of the pipeline is visible: chunk count, retrieval scores, matched query terms, and streaming generation. The user can see exactly which chunks were retrieved and why — BM25 scores rendered as proportional bars. This makes the demo educational as well as functional.

04Key Insight

BM25 is vocabulary-dependent: it requires query terms to appear in the retrieved chunks. This is the core tradeoff vs. neural embeddings, which handle semantic similarity without lexical overlap. For a demo, BM25 is ideal — it's fast, deterministic, and the scoring is interpretable. Production systems typically layer a neural reranker on top of BM25 to get the best of both worlds.

05Why It Matters

Demonstrates the full RAG stack — retrieval strategy, chunking, grounded generation — with all tradeoffs made explicit. Directly relevant to enterprise document AI, legal tech, and any system where source attribution matters. The transparent scoring makes it a teaching tool as well as a demo.

Back to projects

Live demo GitHub