Prompt Engineering
Prompt engineering is the discipline of shaping the text input to a language model so its output is more useful, more reliable, and better aligned with the task. At its core, the work is to treat the prompt as a working interface (something to iterate on, measure against, and version) rather than as a one-shot guess.
What it spans
Prompt engineering shows up at several layers of an LLM-using system:
- System prompts — the persistent layer that sets role, format, and constraints across a conversation or workload.
- User prompts — task framing, anchors, examples, and the specific request itself.
- Techniques — patterns like chain of thought, few-shot examples, structured outputs, tool-use specifications, and the agentic patterns that compose them.
- Composition — when prompts are reused, layered, or programmatically assembled (templates, prompt programs, prompt-as-code). At this layer prompts stop being "a string of text" and become an artifact with versioning, ownership, and review.
What makes a prompt work
A prompt is doing its job when it produces the desired output reliably across the inputs the system will actually see. What "good" looks like depends on the task: instructions that excel for code review can be wrong for summarization.
The levers are concrete:
- Task instructions — what the model is being asked to do, expressed clearly and unambiguously.
- In-context examples — demonstrations of the desired input-output mapping rather than descriptions of it. See few-shot prompting.
- Providing examples is something people tend to do when asked to provide an example. I imagine an hypothetical dialog where Person A makes a statement and Person B says: "I don't understand. Can you give me an example?". It's not something we tend to do extensively until we know the person we're interacting with fairly well. If I know the person I'm speaking with is an avid baseball fan, then I may use baseball analogies to teach new concepts in other domains simply because I know baseball terms and patterns are familiar to the person I'm speaking with.
- Examples work because they show the desired pattern instead of describing it. A description leaves room for interpretation; an example demonstrates what counts. Examples also set format, register, and level of detail implicitly, without requiring the prompt to spell each one out. They help most when the task is ambiguous, when the output needs a specific shape, or when the boundary cases are hard to describe in words. For well-understood, unambiguous tasks, prompts without examples can perform fine; the value of examples scales with how much disambiguation the task needs.
- Format constraints — the shape the output must take, often via structured outputs or schema validation. LLMs can output structured JSON, CSV, code, or prose all equally well. They just need to be told what format is desired.
- Capability invocation — explicit use of what the surrounding system can do: tool calls, sub-agents, skills, retrieval. A prompt written without regard for the host environment (Claude, Cursor, Codex, etc.) is leaving leverage on the table.
A poorly working prompt usually fails at one of these: ambiguous instructions, missing or misleading examples, no enforced output shape, or unused capabilities of the surrounding system.
Underneath the four levers is a skill that doesn't appear in the prompt text: knowing the model. The same instruction can land differently across models because models differ in what they do without being asked, where their attention reliably falls, and which kinds of inputs throw them off. Prompts written for a specific model's strengths outperform prompts written in general terms. Model awareness doesn't make outcomes deterministic, but it does shape which levers a prompt author reaches for and how hard to lean on each.
How prompts fail in production
Production prompts fail in patterned ways. The five patterns below aren't a canonical list (different practitioners would slice the space differently), but they show up often enough to organize around. Two of them — the brittleness pattern and the prompt-injection pattern — have a published academic and security canon, cited inline below. The others are practitioner consensus distilled from engineering blogs, provider docs, and shared experience.
Small phrasing changes have outsized effects. Wording, ordering, whitespace, and example selection can produce disproportionately large changes in output. Lu et al. (2022) showed that example order alone can swing few-shot accuracy from near state-of-the-art to near-random across model sizes, and Sclar et al. (2023) found accuracy swings of up to 76 points across semantically equivalent format variations on LLaMA-2 13B. A prompt that performs well on a small set of hand-picked inputs can regress on the messier inputs real users send. The correction is to evaluate against samples drawn from real usage rather than a small hand-picked set the author already knows by heart.
That phrasing sensitivity compounds when the model changes underneath the prompt. Prompts don't transfer cleanly across models. A prompt tuned for one model rarely works the same on another. The order-sensitivity work cited above also reports that a good prompt configuration for one model often fails to transfer to others; engineering practice reports the same pattern at the broader level of full prompts. Migration between providers (or even between minor versions from the same provider) is a real cost. Prompts that rely on a model's specific capabilities (tool use, sub-agents, skills, extended thinking) won't transfer at all to models that don't support those capabilities. Re-evaluate prompts when changing models, with attention to what capabilities the new model offers and how it tends to behave. That understanding determines which existing prompts survive and which need rebuilding.
Even on a stable model, more isn't better. Simple, focused prompts tend to outperform complex ones. Beyond a threshold, adding more instructions, more examples, or more guardrails makes outputs worse, not better. This is practitioner consensus from engineering blogs and provider docs rather than a single academic finding; three mechanisms recur in those reports. The model spends attention parsing rules instead of doing the task; some rules contradict each other; some rules suppress behaviors that were already correct. Subtractive editing (removing the instruction that no longer earns its keep) is often higher leverage than adding the next clarification. Removing a rule reveals whether it was actually doing work, and rules that "fix" a rare edge case often suppress behaviors that were correct on the common case.
The other source of variability isn't the prompt; it's the input. Real inputs differ from evaluation inputs. Production traffic includes weirder phrasing, longer text, malformed structure, and out-of-distribution requests that no curated evaluation set captures. The phenomenon is the prompting-specific case of distribution shift, a long-standing concern in machine learning. The reliable correction is to sample real production inputs back into the evaluation set on a regular cadence and re-test prompt changes against that drift.
The final failure shifts the threat axis from variability to trust. Untrusted content can carry instructions. Anything in the prompt context that isn't fully controlled — retrieved documents, tool outputs, user-supplied content — can include adversarial instructions aimed at the model. The class of attack was named "prompt injection" by Simon Willison in September 2022, formalized for retrieval and tool-integrated systems by Greshake et al. (2023), and now sits at the top of the OWASP Top 10 for LLM Applications as LLM01.
[!tip]+ Treat untrusted content as data, never as instructions to help secure systems against prompt injection.
Each of these failures has the same antidote underneath: a way to tell whether a prompt change actually improved results. Without an evaluation method, prompt engineering reduces to taste, and taste does not survive contact with real production traffic. That machinery lives at prompt evaluation.
Atomic pages
- chain of thought — telling the model to "think step by step": when it helps, when it doesn't, what's replaced it
- few-shot prompting — providing examples in-context: selection, ordering, and the limits of in-context learning
- system prompts — the persistent role-and-constraint layer and how it composes with the user turn
- structured outputs — constraining responses to a schema: JSON mode, tool calls, grammar decoding
- prompt evaluation — measuring whether a prompt change is real or noise
- prompt versioning — treating prompts as code: version control, code review, rollback
- prompt injection — adversarial inputs and the defenses that actually hold up