Writing
·8 min read

I Built a Visible Multi-Agent System. Here's What Production AI Coordination Actually Looks Like.

A single model and a chat box is a demo. Production AI is a planner decomposing tasks, specialists executing each piece, a synthesizer assembling the output. I built a demo that makes the full pipeline visible in real time — five agents, streaming, coordinating. The architecture is the prompt.

aiagentsengineeringorchestrationstreamingenterprise aiarchitecture

Most AI demos are a single model and a chat box. Production AI is multiple agents with different roles, passing context between them. I built a demo that makes the coordination visible.

The gap between demos and reality

There's a large gap between what most AI demos show and what production AI systems actually look like. A demo is typically: user types prompt → model responds → done. A production system is: user gives a goal → planner decomposes it → specialists execute each piece → synthesizer assembles the output → user sees a polished result.

That decomposition matters. Different subtasks need different expertise. A research task needs different prompting than a writing task. A critique task needs a different stance than a generation task. One generalist model handling all of this in a single context tends to produce averaged, mediocre output across all dimensions.

The multi-agent pattern — give each agent a narrow, well-defined role — produces consistently better output than a single-agent with a complex prompt. I wanted to build a demo that shows this visibly.

The architecture

Multi-Agent Demo runs a five-agent pipeline on any task you give it:

  1. Planner — receives the task, decomposes it into three distinct subtasks with assigned roles
  2. Researcher — tackles the research/evidence subtask
  3. Writer — tackles the synthesis/prose subtask
  4. Critic — challenges assumptions and identifies gaps
  5. Synthesizer — takes all three specialist outputs and produces a polished final result

All agents run on Claude Haiku. The output quality difference isn't about the model — it's about specialisation.

The streaming implementation

The backend orchestrates the pipeline sequentially: Planner → Specialists (conceptually parallel) → Synthesizer. Each step streams back to the frontend via Server-Sent Events. The UI shows each agent's output appearing in real time — you watch the conversation unfold.

The SSE protocol I use is a simple JSON envelope on each event:

// Event types sent from the API route
{ "agent": "planner", "type": "start" }
{ "agent": "planner", "type": "token", "text": "Breaking this into..." }
{ "agent": "planner", "type": "done", "subtasks": [...] }
{ "agent": "researcher", "type": "start" }
// ... and so on

The frontend maintains a state object for each agent and updates it as events arrive. No WebSockets, no external state management — just fetch with ReadableStream and React useState.

The Planner prompt

The Planner is the most important agent in the pipeline. Its output determines the quality of everything downstream. A bad decomposition produces bad specialist outputs no matter how good the specialists are.

The system prompt is tight:

You are a planning agent. Your job is to decompose a task into exactly 3 
subtasks for specialist agents. The subtasks must be:
1. Distinct — no overlap in what they cover
2. Complete — together they cover the full task
3. Concrete — each is actionable by a specialist in one pass

Return ONLY valid JSON:
{
  "subtasks": [
    { "id": 1, "role": "Researcher", "task": "..." },
    { "id": 2, "role": "Writer", "task": "..." },
    { "id": 3, "role": "Critic", "task": "..." }
  ]
}

Forcing JSON output from the Planner gives me structured data I can use to construct each specialist's context. No parsing heuristics, no regex — the subtask is just subtasks[0].task.

Specialist role injection

Each specialist gets a system prompt that defines their role precisely. The Critic's prompt is illustrative:

You are a critical analyst. Your job is to challenge the task, identify 
what's missing, what assumptions are made, and what could go wrong.

Don't just summarise — push back. Find the weaknesses. Be specific.
2-3 focused paragraphs. No hedging.

The "no hedging" instruction matters. Without it, the Critic tends to produce "on the other hand" diplomatic output that doesn't actually critique anything. Telling the model its job is to push back, and then removing the escape hatch, produces sharper output.

The Synthesizer challenge

The Synthesizer is the hardest agent to prompt well. It receives 500-800 tokens from three different specialists and needs to produce a coherent final output without just concatenating them.

The prompt installs a clear hierarchy: use the Researcher's facts, use the Writer's structure and prose quality, incorporate the Critic's strongest objections. The instruction I've found most effective:

Don't summarise the three inputs. Use them as raw material.
The final output should read as if one expert wrote it,
not as if three experts were stitched together.

What the demo teaches

Running a few tasks through the demo makes a few things obvious:

Role clarity beats prompt length. A 50-word specialist prompt with a clear role outperforms a 200-word general prompt. The model performs better when it knows exactly what it is and what it's not.

Decomposition is the skill. The biggest variable in multi-agent quality is how well you split the task. Overlapping subtasks produce contradictory outputs. Incomplete subtasks produce synthesized outputs with holes. The Planner is where to invest.

Context threads forward, not backward. The Synthesizer knows everything; the Researcher knows nothing about what the Writer will produce. This is fine — but it means you need to think carefully about sequencing. A Researcher who doesn't know the Critic's angle might do redundant work.

The architecture is the prompt. How you structure the pipeline — what each agent knows, what it produces, what it receives — determines output quality more than any individual prompt.

Why this matters for enterprise AI

Enterprise AI teams are building multi-agent systems because the alternative — one massive context window handling everything — doesn't scale. Context costs money. Specialisation improves quality. Parallelism reduces latency.

The pattern in production systems: an orchestrator handles routing and state; specialist agents have narrow, deep prompts; a synthesizer or formatter handles final output. Add observability (which I wanted to make visible in this demo) and you have the skeleton of most enterprise AI pipelines.

The demo is deliberately small to make the architecture legible. Production systems have more agents, more complex routing, failure handling, and retry logic. But the core pattern — decompose, specialise, synthesize — is the same.