Chain of Thought
Chain-of-thought prompting asks a language model to produce intermediate reasoning steps before its final answer. The technique was introduced by Wei et al. (2022), who showed that including exemplars with rationales (not only input-output pairs) substantially improved arithmetic, commonsense, and symbolic reasoning on sufficiently large models. The effect was an emergent property of scale. Smaller models gained little or nothing from the same prompting style.
The technique became one of the most widely-cited prompting patterns in 2022 and 2023. Variants and successors include zero-shot CoT, self-consistency, least-to-most prompting, and Tree of Thoughts. Each one optimizes a different limitation in the base CoT recipe.
The two forms
Two recipes are commonly called "chain of thought." They behave differently and warrant separate treatment.
Few-shot CoT (Wei et al. 2022)
The original recipe. The prompt includes several worked examples, each consisting of a problem, a step-by-step rationale, and an answer. The model is expected to imitate the structure on a new problem.
This form benefits from carefully-chosen exemplars. The rationales need to be correct, consistent in format, and representative of the kinds of problems the system will see in production. The original Wei et al. paper documents that exemplar quality and consistency materially affect the technique's gains.
Zero-shot CoT (Kojima et al. 2022)
A single instruction ("Let's think step by step.") appended to the user's question. Kojima et al. (2022) showed this cue alone elicits multi-step reasoning on many tasks without hand-crafted exemplars.
Zero-shot CoT is cheaper to maintain (no exemplar curation) and easier to apply across diverse tasks. The trade-off is variance: the model's reasoning structure is less constrained, and outputs are harder to parse downstream.
Modern instruction-tuned models often produce step-by-step reasoning when the task warrants it without any explicit cue. The cue still helps on borderline cases and on tasks where the model's default is to answer directly.
When CoT helps
The original paper and subsequent benchmarks suggest CoT helps when the task:
- Requires multiple inference steps to reach the answer.
- Has intermediate quantities, sub-results, or named entities the model needs to track.
- Benefits from the model exposing its working so an error surface in one step does not silently corrupt the final answer.
- Runs on a model capable enough to maintain coherent reasoning across the steps. The emergent-scale finding from Wei et al. (2022) still applies: very small models reason worse with CoT than without.
Math word problems, multi-hop question answering, planning tasks, and symbolic manipulation are the canonical fits.
When CoT does not help (or hurts)
CoT is not universally beneficial. A few patterns worth carrying:
- Single-step or pattern-matching tasks. Classification, simple extraction, and short transformations gain little. The extra tokens cost latency and money without improving accuracy.
- Tasks the model solves without working. A model that already knows the answer to a trivia question does not benefit from being asked to derive it. The derivation often introduces errors that did not exist in the direct answer.
- Tasks where the failure mode is knowledge, not reasoning. CoT does not manufacture facts the model lacks. Adding a reasoning scaffold to a knowledge-gap problem produces fluent-but-wrong rationales. The fix is retrieval, not reasoning.
- Tasks with strict latency or token budgets. In practice, CoT inflates output length by an order-of-magnitude factor compared to a direct answer, since the model emits the reasoning chain before the conclusion. On a 30-second user-facing budget, the cost often outweighs the accuracy gain.
The faithfulness problem
The most important caveat to CoT comes from Turpin et al. (2023), titled Language Models Don't Always Say What They Think. The paper shows that chain-of-thought rationales do not always reflect the true cause of a model's prediction. The model produces a fluent rationale, but the actual decision pathway runs elsewhere.
Three implications follow:
- A confident rationale is not evidence the answer is correct. Faithful-sounding reasoning rationalizes incorrect or biased predictions.
- CoT does not give you interpretability on a silver platter. Reading the rationale tells you what the model says it did, not what the model actually did.
- Auditing CoT outputs needs ground-truth checks. Symbolic validators (a Python interpreter, a SAT solver, a database), unit tests, retrieval-source comparison, or human review all matter more than re-reading the rationale.
This does not mean CoT is bad. It means CoT is a performance technique with a misleading side-effect: outputs that look more inspectable than they are.
Successors and complements
Several patterns build on CoT and address its limitations:
- Self-consistency (Wang et al. 2022). Sample multiple reasoning chains, return the majority answer. Improves accuracy on math and commonsense benchmarks at the cost of N× more LLM calls.
- Least-to-most prompting (Zhou et al. 2022). Decompose a hard problem into easier sub-problems first, then solve in order. Improves easy-to-hard generalization.
- Tree of Thoughts (Yao et al. 2023). Generalize CoT into search over candidate intermediate states with self-evaluation and backtracking. Useful for planning and puzzle tasks.
- Program-Aided Language Models (Gao et al. 2022). Generate a runnable program as the reasoning trace, delegate computation to an interpreter. Sidesteps arithmetic errors entirely.
The pattern across all four is the same: take CoT's "make the model show its work" idea and structure the work so it is either verified, decomposed, searched, or executed.
Practical guidance
A few rules of thumb:
- Try the task without CoT first on a recent instruction-tuned model. Modern instruction-tuned models already reason when warranted.
- If CoT helps, decide whether the rationale needs to be visible to the user. Often the right architecture is "reason internally, return only the answer" via a system-prompt instruction or via the model's reasoning-tokens feature where supported.
- For exact arithmetic, symbolic logic, or executable answers, prefer PAL or tool use over pure CoT. The reasoning trace then has a checkable artifact.
- Treat the rationale as a debugging aid, not as proof. Combine with evaluation against ground truth.
Related
- Prompt Engineering — the broader cluster CoT sits inside.
- Few-Shot Prompting — the exemplar mechanism CoT builds on.
- Prompt Evaluation — how to tell whether CoT actually helps your task.