Few-Shot Prompting
Few-shot prompting is the practice of including a small number of input-output examples directly in the prompt to guide the model's behavior on a new input. The model is not retrained on the examples; it sees them as part of the prompt context and uses them to infer the task. The practice is also called in-context learning, the term Brown et al. (2020) used when GPT-3 demonstrated that sufficiently large language models could perform competitively on many NLP tasks from a handful of examples alone, without any gradient updates.
The standard naming convention reflects the example count: zero-shot (no examples, instruction only), one-shot (one example), few-shot (typically two to twenty). Above that count, the practice merges with retrieval-augmented patterns where examples are selected dynamically rather than fixed in the prompt.
What few-shot prompting actually does
The most intuitive story is that examples teach the model the task: show it the input-output mapping, and it generalizes the pattern. That story is partly right and partly misleading.
It is right in the sense that demonstrations make the desired pattern concrete in a way that descriptions cannot. A description leaves room for interpretation; an example demonstrates what counts. Examples also set format, register, and level of detail implicitly, without requiring the prompt to spell each one out.
It is misleading in the sense that the model is not "learning" the task from the examples in the way that word usually implies. Webson and Pavlick (2022) found that models often improve about as much from prompts that are intentionally irrelevant or even pathologically misleading as they do from instructively good ones, and that performance depends more on the choice of output labels than on the example prose itself. Pre-training already established the model's capability; the examples primarily activate the right capability, in the right format, with the right granularity. The framing matters more for choosing examples well than the simple show-don't-tell story suggests.
When few-shot earns its keep
Few-shot helps most when the task has properties that pure instruction-following handles poorly:
- Ambiguous tasks. When the instruction admits multiple reasonable interpretations, examples disambiguate by showing which interpretation the author wants. "Summarize this article" can mean a one-sentence headline or a three-paragraph abstract; an example fixes the answer.
- Format-constrained tasks. When the output must follow a specific shape (a particular JSON schema, a specific markdown structure, a fixed prose pattern), examples set the shape more reliably than descriptions of the shape. This overlaps with structured outputs but doesn't replace it; the strongest format guarantees still come from constrained decoding or schema validation, not from in-prompt examples alone.
- Boundary cases the instruction can't describe. Some distinctions are easier to demonstrate than to articulate. "Treat sarcasm as positive sentiment" is a rule; the example "The interface is so intuitive I needed a tutorial to find the save button" teaches the same rule by demonstration and covers cases the rule didn't anticipate.
- Novel or unusual tasks. When the task sits outside common training distributions, examples ground the model in what the author actually wants rather than the closest familiar task the model would otherwise default to.
For well-understood, unambiguous tasks with no special format requirements, zero-shot is often fine and sometimes better. Adding examples to a task the model already handles well can introduce noise, bias the output toward the examples' specific phrasing, or distract the model from instructions that were working.
How to choose examples
The choice of examples is the load-bearing part of few-shot prompting. The same examples in a different order can produce wildly different results, and the wrong selection can be worse than no examples at all.
Count. The naive intuition is that more examples are always better. The actual relationship is task-dependent and tops out fast: for many classification and short-form generation tasks, performance plateaus by five to ten examples and can degrade beyond that. For complex reasoning tasks, more examples sometimes help by giving the model more variations of the desired reasoning pattern, but the marginal gain shrinks quickly. Start with a small count (three to five) and add only when evaluation shows it helps.
Selection. Examples should cover the space of inputs the model will actually see, not the space the author finds easy to imagine. Real production inputs often differ from cases an author can mentally enumerate; sampling examples from actual usage tends to outperform writing them from scratch. When examples are written by hand, deliberately include cases that look slightly different from each other rather than rephrasings of the same case. Diversity in the examples shows the model what dimensions are allowed to vary.
Order. Order matters more than most authors expect. Lu et al. (2022) showed that the same set of examples in a different permutation can swing few-shot accuracy from near state-of-the-art to near-random, and that the effect persists across model sizes. The paper proposes an entropy-based heuristic for ordering; in practice, most teams treat order as a hyperparameter to evaluate rather than something to optimize analytically.
Format consistency. Sclar et al. (2023) found accuracy swings of up to 76 points on LLaMA-2 13B from semantically equivalent format variations: different separator characters, whitespace, label punctuation, and so on. The same examples laid out two different ways can produce two different model behaviors. Pick a format and apply it uniformly across examples; do not let the format vary between the demonstrated examples and the actual query.
How few-shot prompting fails
Two of the most consequential failure modes were just named: order sensitivity and format sensitivity. A few others to watch for.
Distributional bias in examples. If the examples skew toward one class, one length, or one phrasing pattern, the model often outputs more responses in that mold than the underlying data would justify. A sentiment-classification prompt with five positive examples and one negative tends to overpredict positive on ambiguous cases. Balance examples deliberately when the underlying distribution is meaningful.
Memorization rather than generalization. When examples are highly similar to the production input, the model can produce outputs that pattern-match the example rather than reason from the input. The output looks correct because it resembles a real example; the reasoning isn't actually happening. This is more common with smaller models and with examples that share surface features with the query.
Format drift between examples and query. If the demonstrated examples use one format and the actual query uses a different one, the model's behavior becomes unpredictable. This is a subset of the format-sensitivity problem and is one of the easiest to introduce accidentally when programmatic prompt assembly stitches examples and query from different code paths.
Static examples on a drifting input distribution. Examples chosen well for last quarter's inputs may not generalize to this quarter's. When the input distribution shifts, the examples can become misleading rather than informative, pointing the model at a task that no longer matches what users are sending.
Few-shot, zero-shot, and the alternatives
Few-shot is one technique in a broader space. The decision of which to use is mostly empirical, and prompt evaluation is what makes that decision principled rather than guessed.
- Zero-shot wins when the task is unambiguous, the format is unconstrained or easily described, and the model is large enough to handle the task on its own. It is also faster (fewer tokens) and cheaper (smaller prompt).
- Few-shot wins when the task needs disambiguation, has format requirements that demonstration handles better than description, or sits outside the model's training-time defaults.
- Dynamic few-shot (retrieval-augmented example selection) wins when no small fixed example set covers the input distribution. Examples are selected at runtime based on similarity to the current input, so different queries see different example sets. This handles distribution shift better than static few-shot but adds complexity in the retrieval pipeline.
- Fine-tuning wins when the task is stable enough to justify a model update, when example budgets are large enough to make the update meaningful, and when latency or cost constraints rule out long prompts. It does not eliminate the role of evaluation, which still determines whether the fine-tuned model is actually better than the prompted one.
Few-shot is rarely the long-term answer for a production system. It is, however, almost always a good first answer: fast to set up, easy to iterate, and informative about what the task actually needs even if the eventual implementation moves to something else.