Getting Claude to Return Consistent JSON Every Time

I built a code reviewer that returns structured JSON every time: category, severity, line number, suggestion. The output is always parseable and typed. Here's the system prompt design that makes structured output reliable — and why schema specificity is the key variable.

The problem with unstructured AI output

The naive code review prompt is: “Review this code and tell me what's wrong.” You get back a paragraph. Maybe it's accurate; maybe it's not. You can't sort it by severity, filter by category, or pipe it to another system. It's a wall of text that you read and act on manually.

For a developer tool, that's not good enough. I wanted something you can triage: see at a glance that there are two critical security issues and three minor style suggestions, click into the security issues first, generate fixes on demand. That requires structured output — JSON with a consistent schema.

Why structured output is hard

Getting a language model to reliably return valid, schema-consistent JSON is harder than it looks. The failure modes are:

Markdown fencing: The model wraps the JSON in a code block: ```json\n{...}\n```. Your JSON.parse throws.

Explanatory preamble: “Here is the JSON review of the code:” followed by the JSON. Parse error on the preamble.

Schema drift: The model decides to add extra fields, rename existing ones, or use different enum values than you specified. Your TypeScript interface breaks.

Truncation: For a large codebase, the model starts generating issues and runs out of tokens mid-JSON. Malformed output.

None of these are impossible to handle, but they all require defensive parsing logic. The goal is to eliminate them through prompt design rather than catch them in post-processing.

The system prompt that works

Here's the system prompt for the AI Code Reviewer:

You are an expert code reviewer... Return ONLY valid JSON in this exact format (no markdown, no explanation): [schema follows]

The key phrases are:

“Return ONLY valid JSON” — not “return JSON” or “format your response as JSON.” The word ONLY (capitalized) suppresses preamble and explanation. This alone eliminates the most common failure mode.

“(no markdown, no explanation)” — explicit prohibition of the two most common wrapping patterns. Belt and suspenders.

“in this exact format” — signals to the model that deviation from the specified schema is an error, not a creative choice.

Schema specificity

Vague schemas produce inconsistent results. Specific schemas produce reliable ones.

Compare these two approaches to specifying severity:

Vague: "severity": "the severity of the issue"

Specific: "severity": "<one of: critical|major|minor|info>"

The vague version produces values like “high”, “medium”, “low”, “HIGH”, “Medium severity”, “important” — all different, all requiring normalization. The specific version with enum constraints produces consistent values every time.

The same applies to category enums, line number format, and the structure of the issues array. The more specific the schema specification, the less room the model has to improvise.

Defensive parsing as a fallback

Even with a tight system prompt, models occasionally fence their JSON output. I added one fallback: a regex that strips a markdown code block if present.

// Handle potential markdown fencing
let jsonStr = rawText;
const fenced = rawText.match(/```(?:json)?\s*([\s\S]*?)```/);
if (fenced) jsonStr = fenced[1].trim();

const review = JSON.parse(jsonStr) as ReviewResult;

This is the minimal defensive layer. If the prompt is working, this regex never matches. But it prevents a hard failure on the edge cases.

The dual-mode generation pattern

The code reviewer has two generation modes: a synchronous JSON endpoint for the review, and a streaming SSE endpoint for the fixed code.

This split is intentional. The review needs to be structured and complete before the UI can render it — you can't display a partial JSON object as cards. So it's a standard fetch that waits for the full response.

The fix, on the other hand, is streaming-compatible. It's just code — the user can start reading and editing as it streams in. Streaming also creates a better UX: the typing cursor makes it feel responsive rather than like a 3-second wait.

The fix endpoint accepts either a single issue or the full issues array. The single-issue path adds the issue's suggestion to the prompt; the all-issues path enumerates all of them. Same endpoint, different payloads. The streamed output is rendered in a code panel with a copy button.

Scoring as a UI anchor

The score (0–100) drives the visual ring indicator — green above 80, yellow above 60, orange above 40, red below. This is a slightly arbitrary but useful anchor: it gives a first impression before reading any issue text.

The scoring guide in the system prompt is deliberately simple: 90–100 is excellent, 70–89 is good with minor issues, 50–69 needs improvement, 30–49 has significant issues, below 30 has major problems. Claude applies this consistently enough that the ring color tends to match the severity distribution of the issues.

I've found that users look at the ring first, then check for critical/major badges, then read specific issues. The UI is designed for that triage flow: ring → severity summary → category tabs → individual cards → fix generation.

Where structured output shines

The code review use case is one of the cleanest examples of where structured output pays off. The categories are well-defined (bugs, security, performance, logic, style, docs). The severity levels are well-defined. Line numbers are integers. Suggestions are strings.

Compare this to “summarize this meeting transcript” — where the structure of a good summary isn't obvious, the output might reasonably be prose, and enforcing JSON may actually produce worse results than letting the model write naturally.

Structured output is the right choice when: the output has a known taxonomy, you need to sort or filter results, you need to feed output to another system, or users need to act on specific items rather than read a narrative. Code review hits all four criteria.