laranevans.com
Topics / Prompt Engineering / Structured Outputs

Structured outputs are LLM responses constrained to a predefined shape: JSON matching a schema, a table with named columns, a SQL query, source code in a specified language, a tool call with typed arguments. The constraint matters because downstream systems usually need parseable artifacts, not prose. A model that returns "the temperature in Paris is around 15 degrees" is fine for a user-facing chat surface. A model whose output feeds a database update or a third-party API needs {"city": "Paris", "temp_c": 15} instead.

The technique sits between prompt engineering and system design. Asking nicely for JSON in the system prompt works some of the time. Production reliability requires both a prompt that requests the structure and an enforcement mechanism that catches malformed outputs before they reach the next step.

Three mechanisms (in order of strength)

Provider APIs and open-source toolchains offer three increasingly strict ways to get structured output from a model.

Prompt-only

Ask the model in the system prompt to return JSON matching a schema. Include the schema inline. Include a few example outputs. The model produces JSON-shaped text most of the time, but not always.

This works for low-stakes prototypes and for tasks where occasional malformed output is cheap to retry. It breaks down at scale: even a 99% conformance rate produces ten thousand parse failures per million calls.

JSON mode / response-format constraints

Most modern provider APIs (Anthropic's, OpenAI's, Google's) expose a parameter that constrains output to valid JSON. The decoder is biased toward syntactically valid JSON tokens, and in some implementations hard-constrained. Variants like OpenAI's "structured outputs" (with a JSON Schema parameter) and Anthropic's tool-use shape enforce a specific schema, not only well-formed JSON.

This catches the common failure mode (the model emits a trailing apology that breaks the JSON parser) and is the right default for most production work.

Grammar-constrained decoding

At the decoder level, restrict the next-token sampling distribution to tokens that keep the output grammatically valid against a formal grammar. Open-source tools like Outlines, Guidance, and llama.cpp's grammar feature implement this. The result: the model literally cannot emit a token that would invalidate the schema.

Grammar-constrained decoding is the strongest guarantee available short of post-validation. It works best when the model is locally hosted or the provider exposes a grammar parameter. It is also the most invasive and the most expensive to implement.

When the model goes off-format anyway

Even with JSON mode, a few failure modes persist:

  • The JSON is well-formed but semantically wrong. Required fields are present but contain the wrong type (a string where an integer was expected, an empty list where a non-empty list was required).
  • Hallucinated values inside valid structure. The schema validates, but the data is fictional. A {"price_usd": 47.99} for a product that does not exist still serializes correctly.
  • Truncation under output-token limits. A long output gets cut off mid-array. The first part of the JSON is valid up to the truncation point and then breaks.
  • The model wraps the JSON in prose. With prompt-only enforcement, the model sometimes returns Sure, here is the data: {"city": "Paris"}. The prefix breaks naive parsers.

Defenses are the same across cases: validate the parsed JSON against a schema (Pydantic, Zod, JSON Schema), retry on validation failure with the validator's error message in the next prompt, cap retries, and surface the failure as structured error data when the cap is hit.

When structured output is the wrong tool

Structured output is appropriate when the task has a clear shape and the consumer is software. It is the wrong tool when:

  • The task requires natural-language reasoning that does not fit neatly into a schema. Forcing a free-form answer into a {"reasoning": "...", "answer": "..."} shape degrades quality, because the model spends attention on the wrapping instead of the substance.
  • The downstream consumer handles prose directly. A user-facing chat assistant does not need JSON.
  • The work is exploratory and the schema is not yet known. Structured outputs are most useful when the schema is stable. Iterating on the schema as the prompt iterates is a sign the task is not ready for this layer.

Structured outputs and tool calling

Tool calling is structured output in a specific shape: an assistant message containing a tool name and JSON arguments matching the tool's parameter schema. Provider APIs treat tool calling as a first-class message type, which means the structure is enforced at the API boundary rather than at the application layer.

A useful pattern: if the task's output naturally maps to "call this function with these arguments," prefer the tool-call surface over a JSON-mode response. The tool-call message type carries a tool-use ID, error semantics, and a clean conversation shape for the result. JSON mode is the right answer when the structure is the output and there is no function to call.

Structured outputs and program-aided generation

For tasks that require exact computation, code generation is a stronger structured-output target than data. Program-Aided Language Models (Gao et al. 2022) showed that asking a model to generate Python and delegating execution to an interpreter outperforms asking the model to compute the answer in its head.

When the structure of the output is "a value that needs to be computed," the right form is often "a program that computes the value, run by your harness." The model is good at writing code that computes things and unreliable at being the calculator.

Practical guidance

A few rules of thumb:

  • Default to JSON mode (or the provider's equivalent schema-enforcing parameter) for any output that downstream software consumes.
  • Always validate against a schema after parsing. Treat the model's JSON mode as a strong hint, not a guarantee.
  • Cap retries. Three attempts that all fail validation usually means the schema is wrong or the task is misspecified, not that the fourth attempt will succeed.
  • For exact computation, prefer tool calls or PAL over a JSON value the model produced through reasoning.
  • Keep the schema small. Models follow tight schemas more reliably than sprawling ones with many optional fields.
  • Surface validation failures as structured error data, not as a generic "the model failed" message. Downstream systems benefit from knowing which field broke.