Writing
·7 min read

Why I Implemented BM25 From Scratch Instead of Using an Embedding API

RAG demos always use vector embeddings. I used BM25 — the same algorithm that powers Elasticsearch and Apache Lucene. No external embedding API, no cold starts, fully deterministic and interpretable. Here's what I learned about when lexical retrieval is the right choice.

airagsearchengineeringinformation retrieval

RAG demos always use vector embeddings. I used BM25 — the same algorithm that powers Elasticsearch and Apache Lucene. No external embedding API, no cold starts, fully deterministic and interpretable. Here's why that was the right call, and what it taught me about retrieval tradeoffs.

The problem with embedding-dependent RAG demos

Every RAG tutorial follows the same pattern: use OpenAI's text-embedding-3-small (or ada-002 if it's old), store vectors in Pinecone or a local FAISS index, compute cosine similarity, retrieve top-k chunks, generate with GPT-4. Rinse, repeat.

That's a perfectly valid architecture. But for a portfolio demo, it has problems: it requires an OpenAI API key, cold starts involve downloading model weights or connecting to a vector store, and — most importantly — the retrieval step is still a black box. You see that chunks were retrieved; you don't see why.

I wanted to build something where the retrieval logic is visible, explainable, and zero-dependency. That led me to BM25.

What BM25 actually is

BM25 (Okapi BM25) is a probabilistic ranking function developed in the 1990s. It's the default retrieval algorithm in Elasticsearch, Apache Lucene, and Solr — the systems that power most of the world's production search. It's not old-fashioned; it's the industry baseline.

The core formula scores each document against a query based on:

Term frequency (TF): How many times does the query term appear in this chunk? More occurrences → higher score. But with saturation: going from 1 occurrence to 2 matters more than going from 10 to 11. This is controlled by the k₁ parameter (I used 1.5).

Inverse document frequency (IDF): How rare is this term across all chunks? Terms that appear in every chunk ("the", "a", "is") have near-zero IDF weight. Terms that appear in only one chunk are highly distinctive. Rare terms that match are strong signals.

Length normalization: A 2,000-word chunk will naturally accumulate more term hits than a 200-word chunk. BM25 normalizes by document length so longer chunks don't dominate just because they contain more text. Controlled by b (I used 0.75).

The result is a score per chunk, computed in pure TypeScript, with no external API call. Fast, deterministic, explainable.

Implementing it from scratch

I wrote the BM25 index builder and scorer from scratch rather than using a library. The implementation builds an in-memory index per request: for each chunk, count term frequencies; across all chunks, compute inverse document frequencies; at query time, score every chunk in O(q × n) where q is query terms and n is chunk count.

For a demo with 2–5 documents and ~20 chunks, this is instantaneous. For a production system with millions of documents, you'd pre-build a persistent inverted index (which is what Elasticsearch does). The API design is the same; the data structure scales differently.

The matched terms are also surfaced in the UI — every chunk shows which query terms it matched, rendered as small tags. This is what makes retrieval interpretable: you can look at a high-scoring chunk and immediately understand why it ranked first.

The chunking strategy

Chunking is where most RAG implementations cut corners. The naive approach — split by character count every N chars — often breaks mid-sentence, severs context, and produces chunks that are incoherent on their own.

I implemented paragraph-aware chunking: split on double newlines first (preserving semantic units), then fall back to character-level splitting only when individual paragraphs exceed the chunk size limit. Chunks overlap by ~80 characters to prevent context loss at boundaries.

The overlap is critical. Without it, a fact that spans two chunks — a number introduced at the end of one chunk and contextualized at the start of the next — is split across retrieval units. The answer model may retrieve one without the other and produce a confused response. Overlap is cheap; missing context is expensive.

Where BM25 falls short

BM25 is vocabulary-dependent. If the query uses different words than the document — “heart rate decoupling” vs “aerobic drift” — BM25 scores zero even if the concepts are identical. Neural embeddings handle this: they encode semantic meaning rather than surface tokens, so synonyms and paraphrases still match.

For production systems where query vocabulary is unpredictable — user-facing search, enterprise knowledge bases, multilingual corpora — you need neural retrieval or at least a synonym expansion layer on top of BM25. Elasticsearch supports both: BM25 as the default, vector search (kNN) as an optional layer, and reciprocal rank fusion to combine them.

The hybrid approach — BM25 for initial recall, neural reranker for precision — is often better than either alone. BM25 is fast and high-recall. The reranker (a cross-encoder like ms-marco-MiniLM-L-6-v2) is accurate but slow. Run BM25 to get 50 candidates, rerank to get the top 5. This is the production architecture at companies like Cohere, Weaviate, and Elastic.

Why BM25 was right for this demo

The point of the RAG Demo isn't to be a production retrieval system. It's to make the RAG pattern visible and understandable. BM25 achieves this better than neural embeddings for three reasons:

Interpretability. You can see exactly why a chunk scored 4.2 vs 1.8 — which terms matched, how often, how rare they are. Neural embedding similarity is a 768-dimensional dot product. No human reads that.

Zero external dependencies. No API key, no warm-up period, no rate limits. The demo works the moment you load it.

Demonstrates the tradeoff. Showing BM25 and explaining its limitation (vocabulary dependency) teaches more about RAG than showing a black-box embedding call. The demo UI links to what you'd do differently in production.

The generation layer

With the top-5 chunks retrieved, the generation step is straightforward: send them to Claude Haiku with a citation-enforcement system prompt. The constraint matters: the model is instructed to use only the retrieved context, cite sources inline with [Source N] notation, and explicitly say when information isn't present in the sources.

This isn't just cosmetic. It's a hallucination prevention contract. A model that's allowed to combine retrieved context with parametric knowledge is a model that can invent plausible-sounding facts that happen to fit the context. Citation enforcement forces the model to show its work — which chunk is the answer coming from, and therefore whether the answer is traceable to source material.

For legal tech, medical information, and enterprise knowledge bases, source traceability isn't a nice-to-have. It's the product.