Most AI research tools are a black box — you ask, it answers, you have no idea what happened in between. I wanted to see the work. So I built a research agent that externalizes its entire process: question decomposition, per-sub-query research, cross-analysis, and synthesis — all streaming live in front of you.
Most people practice salary negotiation by reading articles. The problem: reading doesn't prepare you for hearing 'that's at the top of our range.' You need reps. I built an AI HR manager that actually pushes back — with a coach mode that teaches you what you're doing right and wrong after each round.
Six applications ready to send, zero practice done. So I built an AI interview simulator: pick a target company, get grilled by an AI interviewer (streamed live), get scored after each answer, finish with a full report card. The hardest part wasn't the streaming — it was calibrating the score so feedback was actually useful.
Andrej Karpathy called it context engineering — and the shift in language reflects a shift in what actually matters. Not tricks or magic phrases. The discipline of structuring what goes into the context window. I built a live demo showing 6 strategies streaming in parallel so you can see the difference for yourself.
Every AI conversation starts from zero — unless you build memory infrastructure. I built Agent Memory Demo to make the pattern tangible: two parallel API calls per message, a structured fact store by category, memory injected into context naturally. The pattern that makes AI assistants feel like they actually know you.
A single model and a chat box is a demo. Production AI is a planner decomposing tasks, specialists executing each piece, a synthesizer assembling the output. I built a demo that makes the full pipeline visible in real time — five agents, streaming, coordinating. The architecture is the prompt.
Your browser ships with SpeechRecognition and SpeechSynthesis built in. I used them — plus Claude — to build a voice AI assistant with under-1-second response latency. The LLM integration took 30 minutes. The voice pipeline took 4 hours. Here's what the ratio reveals about building production voice AI.
Zero-shot, few-shot, chain-of-thought, and system-prompt tuning produce meaningfully different outputs — not just in style, but in accuracy and reliability. I built Prompt Lab to make these differences visible: four simultaneous SSE streams, same prompt, different techniques. Here's what I actually learned.
Standups are a translation problem: commit messages → accomplishment statements. That translation is high-friction, low-value work — exactly what language models are good at. I built ai-standup: one command reads git history across all my repos, sends it to Claude, and generates a professional standup in 4 seconds. Here's how it works and what it reveals about AI-native dev tooling.
Most AI teams spend 90% of their time on the model and 10% on evaluation. Production teams flip that. I built AI Eval Lab to make evaluation fast enough to actually happen — define test cases before writing prompts, run them all at once, score automatically, get AI-powered fix suggestions on failures.
aievaluationprompt engineeringengineeringtestingproduction ai
Commit messages, PR descriptions, and code explanations are high-friction, low-value work. I built three CLI tools that use Claude to eliminate them: ai-commit (staged diff → 3 conventional commit options → you pick), smart-pr (branch diff → structured PR description), and ai-explain (pipe any code, get an explanation). Here's how they work and why raw fetch beats the SDK for CLI tooling.
Three Claude tools embedded in a job application Kanban board: streaming cover letter generator, job fit analyzer with visual scoring, and tailored interview prep. The AI has my full background in context — published papers, portfolio apps, role differentiators. It generates letters that are actually specific, not generic.
Most AI demos are stateless: one prompt, one response. Real agentic systems plan, decompose into steps, run in parallel, handle failures, and synthesize across multiple context windows. I built one with full visibility at every step — and the hard part wasn't the AI calls.
TTFT, tokens/second, total time — all visible, side by side, for the same prompt. I built a real-time streaming comparison tool because model selection shouldn't be based on benchmarks you can't feel. Here's what I actually learned about the Haiku/Sonnet tradeoff from watching them race.
Client updates from activity logs. Billing narratives from time entries. Calendar events from court documents. These aren't AI features — they're AI workflows. I built all three in one session to understand the pattern. The insight: the interesting engineering is in the output format, not the API call.
The Model Context Protocol is everywhere in enterprise AI discussions, but most demos are black boxes. I built one that makes the full protocol visible — initialize handshake, tool discovery, every tool call and result, token costs — streamed live to a trace panel as it happens.
There's a difference between understanding MCP and implementing it. I built a working Node.js MCP server using the official @modelcontextprotocol/sdk — stdio transport, 6 tools with Zod schemas, full JSON-RPC 2.0 protocol. Install it in Claude Desktop with one config line and ask Claude about my portfolio.
mcpnode.jstypescriptprotocolclaude desktopenterprise ai
RAG demos always use vector embeddings. I used BM25 — the same algorithm that powers Elasticsearch and Apache Lucene. No external embedding API, no cold starts, fully deterministic and interpretable. Here's what I learned about when lexical retrieval is the right choice.
I built a chat interface that makes the agentic loop visible: the AI reasons about tools, calls them, gets results, and synthesizes a response — and you can see every step. Here's the implementation and why tool descriptions matter more than schemas.
I built an AI contract analyzer that extracts risk flags, obligations, and negotiation leverage from any legal document. The interesting part isn't the AI — it's the output design. Why 'here's the bad clause' is less useful than 'here's what to ask for instead.'
ailegal techstructured outputengineeringenterprise ai
I built a document Q&A tool in four hours. The code was simple — the interesting part was the system prompt design that prevents hallucination and requires exact citations. Why 'don't speculate' is the most important instruction in document intelligence.
I published two papers in the European Journal of Applied Physiology, then built an AI chat that uses them as its knowledge base. Why I chose a hand-crafted knowledge base over RAG, how to design for scientific honesty, and what domain expertise multiplies.
Enterprise AI runs on multi-step LLM pipelines. I built a visible one: 4 stages, 4 specialized prompts, context accumulating forward. Each stage shows timing, token usage, and intermediate output. Here's the architecture and what building it revealed.
The Model Context Protocol is USB for AI capabilities. Here's what I learned building six MCP servers in production — tool description design, access boundaries, and why structured error messages matter.
I spent years studying strength adaptation as a sports scientist before I wrote production code. Building PR detection for my workout tracker brought those worlds together — the Epley e1RM formula, per-set benchmarking, and what the data actually shows about nonlinear strength gains.
I wanted to understand the AI video clipping space — so I built the minimal version of the core loop. Transcript in, highlights out. Then I noticed where my minimal version broke. That's where the real engineering is.
Six weeks ago I gave an AI agent access to my calendar, email, GitHub, and production servers. It runs 24/7, builds apps while I sleep, and sends me a morning brief. Here's what actually happened.
A month-long sprint to build a portfolio from scratch. What I shipped, what I scrapped, and what the process taught me about the future of software development.
I published peer-reviewed research in exercise physiology before I wrote my first line of production code. Here's why that turned out to be an advantage.
Paste a job description and get: company research brief, role analysis with red flags, tailored talking points, 5 interview prep questions, salary intel, and an application checklist — all streaming live. The most useful section is the one that looks for reasons NOT to apply.
Pick a behavioral category, get a real question, answer in Practice or Timed Mode, then receive a scored rubric: Situation, Task, Action, Result each out of 2.5 — plus strengths, what to improve, and the ideal answer structure. Here's the hardest part: calibrating the scoring anchors.
Paste your legacy code, describe the migration target (Redux → Zustand, class → hooks, Pages Router → App Router), and watch the migrated code stream in alongside a line-by-line diff and explanation. The interesting part wasn't the Claude prompt — it was figuring out what information actually makes a migration useful.
Visualize any GitHub user's contribution patterns with an interactive heatmap. Enter a username to see their full contribution calendar, streak stats, language breakdown, and AI-generated insights about their coding habits and patterns. Built in one overnight session as part of a rapid AI-native development sprint.
Practice system design interviews with an AI Staff Engineer. Pick a challenge (Design YouTube, Design Uber, etc.), chat through your architecture, and watch a live Mermaid.js diagram build in real time. Timer, follow-up questions, and a scored debrief at the end. Built in one overnight session as part of a rapid AI-native development sprint.
When you're applying to Anthropic, the questions aren't generic — they're about evals, context engineering, tool use. I built a tool that takes any job description and generates a targeted flashcard deck: explicit skills, implied concepts, system design, and company-specific behavioral questions. With CSS 3D flip and spaced repetition.
GitHub profiles are noisy. I built a tool that synthesizes repos, languages, and activity into a clear engineer card — with a recruiter perspective and actionable quick wins. Here's how it works.
Paste a GitHub PR URL, get a streamed AI code review — file by file, risk scored, with a security check. The interesting part wasn't the GitHub API integration. It was learning that prompt structure and UI rendering strategy need to be designed together when you're streaming structured markdown.
If you're getting 'Could not resolve authentication method' errors from the Anthropic SDK on Vercel, it's a streaming compatibility bug — not your API key. The fix is one pattern change. Here's exactly what breaks, why, and how to replace the SDK with raw fetch.
I built a code reviewer that returns structured JSON: category, severity, line number, suggestion. The output is always parseable and typed. Here's the system prompt design that makes structured output reliable, and why schema specificity is the key variable.
I wanted a personal finance app that didn't send my bank data to third-party servers. So I built one: CSV import with format inference, Claude-powered transaction categorization, and multi-scenario financial modeling — all client-side.
Recruiters can now ask my portfolio questions and get real-time streaming answers from Claude. Here's the technical breakdown: ReadableStream, Anthropic SDK streaming, system prompt design, and model selection.
The Durability Analyzer started as peer-reviewed science. Building the product from the paper revealed a gap between what researchers communicate and what athletes actually need.