Live

Agent Memory Demo

Persistent AI memory made visible — extract, store, inject, in real time

View Live View Code

Built in One overnight session (March 3, 2026)

Next.jsAnthropic ClaudeAI MemorySSE StreamingTypeScriptAI ArchitectureAgent Design

Parallel extract + chat API calls

4-category memory taxonomy

localStorage persistence

Real-time memory graph

01The Problem

LLMs are stateless. Every API call starts fresh. The gap between 'AI that forgets you immediately' and 'AI that feels like it knows you' is entirely in the memory infrastructure around the model — not the model itself. I wanted to make that infrastructure tangible: not a diagram explaining the pattern, but a running system you can interact with and watch.

02The Approach

Two API calls fire per user message: the main chat response (Claude with full memory injected into the system prompt) and a lightweight extraction call (Claude Haiku: 'what new facts did we learn from this message?'). New facts appear in the right panel in real time. Memory persists via localStorage — close the tab, come back, Claude greets you with everything it learned.

03Architecture Decisions

Parallel extract + chat calls

Two API routes: /api/chat (streaming, full Claude Haiku response with memory in system prompt) and /api/extract (non-streaming, lightweight fact extraction). Both fire simultaneously on each user message. Extraction completes in <200ms — rarely adds perceptible latency.

Memory injection via system prompt

The current memory store is serialized as structured text and prepended to the chat system prompt. Claude uses the memory naturally — weaving known facts into responses without announcing it. The instruction: 'You know the following about the user. Use this context naturally without announcing it.'

Fact categorization at extraction time

The extraction prompt asks Claude to classify each fact into one of four categories: About You, Preferences, Goals, Context. This taxonomy separates durable identity facts from transient situational ones — in a production system, different retention and retrieval policies would apply to each.

04Key Insight

The memory extraction prompt's most important constraint is 'only return genuinely new facts not already in memory.' Without it, the model re-extracts the same facts across multiple turns, creating redundant noise. Deduplication is a memory engineering problem, not a model problem.

05Why It Matters

Demonstrates the foundational pattern behind AI personalization features at Notion, Intercom, Linear, and similar products. Directly relevant to Cohere (RAG/memory pipelines), Giga (voice agents need session continuity), and any enterprise AI product where user context matters.

Back to projects

Live demo GitHub