I Built an AI on My Own Research Papers. Here's What I Learned.

Last year I published two papers in the European Journal of Applied Physiology. This week I built an AI that answers questions using them as its knowledge base. Here's what that process taught me about the gap between research and product — and why domain expertise is AI's most underrated advantage.

The gap that bothered me

When I finished my second paper — a study on predicting physiological decoupling in trained cyclists — I sent the PDF to a friend who coaches amateur triathletes. His response: “This is really interesting. What do I do with it?”

That question stuck with me. We'd run the study, collected the data, written up the findings. But between “conclusion” and “what does this mean for Tuesday's training session” was a gap that academic papers don't cross.

Most research doesn't cross it. Papers get cited by other papers. Findings trickle down through coaches, podcasts, and Reddit threads, losing nuance at every step. The athletes who would most benefit never read the journal.

So I built AthleteIQ — a sports science AI chat that answers training questions backed by the actual research, including my own papers, and explains it in terms that a non-researcher can act on.

The knowledge base problem

When most people think “AI backed by research” they immediately think RAG: chunk the PDFs, embed them, retrieve on query. It's the standard move. But for this project, I deliberately went the other way — a hand-crafted knowledge base encoded in a system prompt.

Here's why: RAG is a retrieval problem. You're betting that the chunk containing the right information will be close enough in embedding space to the user's query. For broad corpora that's fine. But for a narrow domain with specific formulas, precise findings, and conceptual relationships, approximate retrieval introduces exactly the wrong kind of errors. A sports scientist getting the Epley formula slightly wrong, or confusing VT1 with VT2, is worse than useless.

Instead, I encoded the key findings, formulas, methodologies, and implications directly:

// From lib/knowledge-base.ts
export const SYSTEM_PROMPT = `You are AthleteIQ — a sports science AI assistant...

### Harrison Dudley-Rode's Published Research

**Paper 1: Carbohydrate Ingestion and Ventilatory Threshold (2024)**
- Citation: Dudley-Rode et al. (2024). European Journal of Applied Physiology.
  DOI: 10.1007/s00421-024-05687-w
- Key finding: Acute carbohydrate ingestion does NOT significantly alter
  the ventilatory threshold (VT1) during incremental exercise in trained cyclists
- Methodology: Crossover RCT, n=12 trained cyclists, 75g CHO vs placebo
- Implication: Athletes should not expect carbohydrate supplementation to 
  shift their aerobic threshold acutely. Long-term metabolic training 
  remains the primary driver of VT1 adaptation.
`

This is precise in a way that retrieved chunks rarely are. The model knows exactly what the paper found, what it didn't find, what the methodology was, and what a practitioner should conclude. No hallucination risk about the finding itself — it's hardcoded.

The tradeoff: it doesn't scale to 10,000 papers. But for a focused domain with 6-8 foundational concepts and 5-10 key studies, it's more reliable and faster to build.

Designing for scientific honesty

The hardest part wasn't the code — it was writing the system prompt in a way that would make the AI scientifically honest rather than confidently wrong.

AI models have a frustrating tendency to synthesize plausible-sounding answers when the evidence is actually mixed. Ask most LLMs “does sweet spot training work?” and you'll get a confident yes with citations. The real answer is: it's popular, the science supporting polarized training is stronger, and the literature is actively debated.

So I built epistemic honesty into the instructions:

Say “the evidence is mixed” when it actually is
Distinguish between “this is well-established” and “this is one study”
Be honest about what a small n means for generalizability
Never extrapolate findings beyond what the research actually supports

The goal was an AI that a researcher would trust, not just one that sounds authoritative.

The streaming architecture

The technical implementation is straightforward: Next.js App Router on the frontend, Anthropic's SDK on the backend, streamed via a ReadableStream over the API route.

// app/api/chat/route.ts
export async function POST(request: Request) {
  const { messages } = await request.json();
  
  const stream = await client.messages.stream({
    model: "claude-haiku-4-5",
    max_tokens: 1200,
    system: SYSTEM_PROMPT,
    messages: formattedMessages,
  });

  const readable = new ReadableStream({
    async start(controller) {
      const encoder = new TextEncoder();
      for await (const chunk of stream) {
        if (chunk.type === "content_block_delta" && 
            chunk.delta.type === "text_delta") {
          controller.enqueue(encoder.encode(chunk.delta.text));
        }
      }
      controller.close();
    },
  });

  return new Response(readable, {
    headers: { "Content-Type": "text/plain; charset=utf-8" },
  });
}

I chose Claude Haiku 4.5 deliberately — it's fast enough that streaming feels snappy, and it handles scientific content with sufficient precision. For this use case, latency and cost matter more than raw capability. A response that appears in 2 seconds feels conversational; one that takes 10 doesn't, regardless of quality.

What the AI gets right that ChatGPT doesn't

I tested the same questions through ChatGPT and AthleteIQ side by side.

On “does carbohydrate intake affect my aerobic threshold?” — ChatGPT confidently explained that yes, carbohydrates support aerobic performance by improving glycolytic capacity and delaying lactate accumulation, eventually getting around to mentioning that training adaptations take time. It conflated performance effects with threshold location.

AthleteIQ answered: acute CHO ingestion doesn't shift VT1, citing the actual crossover RCT. It then explained the distinction between acute performance effects (carbs help you maintain power) and metabolic threshold location (VT1 is determined by your training history, not what you ate this morning). Cleaner, more precise, more actionable.

That's the advantage of building on specific research rather than a general knowledge base — the model can't confabulate because it's working from explicit, precise source material.

The domain expertise multiplier

Here's what this project confirmed for me: domain expertise and AI engineering are multiplicative, not additive. A general AI engineer building an exercise physiology chat will spend weeks learning enough to write an accurate system prompt. I spent an afternoon, because I already knew which concepts matter, which research is solid versus preliminary, and which questions coaches actually ask.

The flip side is also true: I could have written a blog post explaining my papers. But an interactive tool that a coach can query during a planning session — at any time, for any athlete — delivers value in a way that a static PDF never will. Engineering unlocks the research.

My sports science background felt like a liability when I started building software. Turns out it's an unfair advantage in an AI-native world. The models are commoditizing code generation. What they can't commoditize is the judgment to know which knowledge to encode, which findings to trust, and what the right answer actually is.

What's next

The current version is a chat interface backed by a curated knowledge base. The natural next step is personalization — if you're training for a marathon in 12 weeks, the advice on periodization changes significantly from if you're a track cyclist peaking for a national championship.

Further out: integrating with actual training data. The Durability Analyzer already parses Garmin FIT files. An AI that can see your actual power files and answer questions in that context — “given my FIT files from the last 6 weeks, how should I approach this week?” — is genuinely interesting.

That's the version that starts to look less like a demo and more like a tool I'd use myself.

Try it: athlete-iq-seven.vercel.app · Source: github.com/matua-agent/athlete-iq

Papers referenced: Dudley-Rode et al. (2024) · Dudley-Rode et al. (2025)