Andrej Karpathy called it context engineering — and the shift in language reflects a shift in what actually matters. Not tricks or magic phrases. The discipline of structuring what goes into the context window. I built a live demo showing 6 strategies streaming in parallel so you can see the difference for yourself.
Every AI conversation starts from zero — unless you build memory infrastructure. I built Agent Memory Demo to make the pattern tangible: two parallel API calls per message, a structured fact store by category, memory injected into context naturally. The pattern that makes AI assistants feel like they actually know you.
A single model and a chat box is a demo. Production AI is a planner decomposing tasks, specialists executing each piece, a synthesizer assembling the output. I built a demo that makes the full pipeline visible in real time — five agents, streaming, coordinating. The architecture is the prompt.
Your browser ships with SpeechRecognition and SpeechSynthesis built in. I used them — plus Claude — to build a voice AI assistant with under-1-second response latency. The LLM integration took 30 minutes. The voice pipeline took 4 hours. Here's what the ratio reveals about building production voice AI.
Zero-shot, few-shot, chain-of-thought, and system-prompt tuning produce meaningfully different outputs — not just in style, but in accuracy and reliability. I built Prompt Lab to make these differences visible: four simultaneous SSE streams, same prompt, different techniques. Here's what I actually learned.
Standups are a translation problem: commit messages → accomplishment statements. That translation is high-friction, low-value work — exactly what language models are good at. I built ai-standup: one command reads git history across all my repos, sends it to Claude, and generates a professional standup in 4 seconds. Here's how it works and what it reveals about AI-native dev tooling.
Most AI teams spend 90% of their time on the model and 10% on evaluation. Production teams flip that. I built AI Eval Lab to make evaluation fast enough to actually happen — define test cases before writing prompts, run them all at once, score automatically, get AI-powered fix suggestions on failures.
aievaluationprompt engineeringengineeringtestingproduction ai
Commit messages, PR descriptions, and code explanations are high-friction, low-value work. I built three CLI tools that use Claude to eliminate them: ai-commit (staged diff → 3 conventional commit options → you pick), smart-pr (branch diff → structured PR description), and ai-explain (pipe any code, get an explanation). Here's how they work and why raw fetch beats the SDK for CLI tooling.
Three Claude tools embedded in a job application Kanban board: streaming cover letter generator, job fit analyzer with visual scoring, and tailored interview prep. The AI has my full background in context — published papers, portfolio apps, role differentiators. It generates letters that are actually specific, not generic.
Most AI demos are stateless: one prompt, one response. Real agentic systems plan, decompose into steps, run in parallel, handle failures, and synthesize across multiple context windows. I built one with full visibility at every step — and the hard part wasn't the AI calls.
TTFT, tokens/second, total time — all visible, side by side, for the same prompt. I built a real-time streaming comparison tool because model selection shouldn't be based on benchmarks you can't feel. Here's what I actually learned about the Haiku/Sonnet tradeoff from watching them race.
Client updates from activity logs. Billing narratives from time entries. Calendar events from court documents. These aren't AI features — they're AI workflows. I built all three in one session to understand the pattern. The insight: the interesting engineering is in the output format, not the API call.
The Model Context Protocol is everywhere in enterprise AI discussions, but most demos are black boxes. I built one that makes the full protocol visible — initialize handshake, tool discovery, every tool call and result, token costs — streamed live to a trace panel as it happens.
There's a difference between understanding MCP and implementing it. I built a working Node.js MCP server using the official @modelcontextprotocol/sdk — stdio transport, 6 tools with Zod schemas, full JSON-RPC 2.0 protocol. Install it in Claude Desktop with one config line and ask Claude about my portfolio.
mcpnode.jstypescriptprotocolclaude desktopenterprise ai
RAG demos always use vector embeddings. I used BM25 — the same algorithm that powers Elasticsearch and Apache Lucene. No external embedding API, no cold starts, fully deterministic and interpretable. Here's what I learned about when lexical retrieval is the right choice.
I built a chat interface that makes the agentic loop visible: the AI reasons about tools, calls them, gets results, and synthesizes a response — and you can see every step. Here's the implementation and why tool descriptions matter more than schemas.
I built an AI contract analyzer that extracts risk flags, obligations, and negotiation leverage from any legal document. The interesting part isn't the AI — it's the output design. Why 'here's the bad clause' is less useful than 'here's what to ask for instead.'
ailegal techstructured outputengineeringenterprise ai
I built a document Q&A tool in four hours. The code was simple — the interesting part was the system prompt design that prevents hallucination and requires exact citations. Why 'don't speculate' is the most important instruction in document intelligence.
I published two papers in the European Journal of Applied Physiology, then built an AI chat that uses them as its knowledge base. Why I chose a hand-crafted knowledge base over RAG, how to design for scientific honesty, and what domain expertise multiplies.
Enterprise AI runs on multi-step LLM pipelines. I built a visible one: 4 stages, 4 specialized prompts, context accumulating forward. Each stage shows timing, token usage, and intermediate output. Here's the architecture and what building it revealed.
The Model Context Protocol is USB for AI capabilities. Here's what I learned building six MCP servers in production — tool description design, access boundaries, and why structured error messages matter.
I spent years studying strength adaptation as a sports scientist before I wrote production code. Building PR detection for my workout tracker brought those worlds together — the Epley e1RM formula, per-set benchmarking, and what the data actually shows about nonlinear strength gains.
I wanted to understand the AI video clipping space — so I built the minimal version of the core loop. Transcript in, highlights out. Then I noticed where my minimal version broke. That's where the real engineering is.
Six weeks ago I gave an AI agent access to my calendar, email, GitHub, and production servers. It runs 24/7, builds apps while I sleep, and sends me a morning brief. Here's what actually happened.
A month-long sprint to build a portfolio from scratch. What I shipped, what I scrapped, and what the process taught me about the future of software development.
I published peer-reviewed research in exercise physiology before I wrote my first line of production code. Here's why that turned out to be an advantage.
If you're getting 'Could not resolve authentication method' errors from the Anthropic SDK on Vercel, it's a streaming compatibility bug — not your API key. The fix is one pattern change. Here's exactly what breaks, why, and how to replace the SDK with raw fetch.
I built a code reviewer that returns structured JSON: category, severity, line number, suggestion. The output is always parseable and typed. Here's the system prompt design that makes structured output reliable, and why schema specificity is the key variable.
I wanted a personal finance app that didn't send my bank data to third-party servers. So I built one: CSV import with format inference, Claude-powered transaction categorization, and multi-scenario financial modeling — all client-side.
Recruiters can now ask my portfolio questions and get real-time streaming answers from Claude. Here's the technical breakdown: ReadableStream, Anthropic SDK streaming, system prompt design, and model selection.
The Durability Analyzer started as peer-reviewed science. Building the product from the paper revealed a gap between what researchers communicate and what athletes actually need.