laranevans.com
Topics / Prompt Engineering / Tree of Thoughts

Tree of Thoughts generalizes chain-of-thought into a deliberate search over candidate intermediate states. The model produces multiple candidate "thoughts" at each step, evaluates them with self-judgment, and chooses which branch to expand. Yao et al. (2023), in Tree of Thoughts: Deliberate Problem Solving with Large Language Models, introduced the framework and showed it outperformed CoT on tasks requiring planning, lookahead, or backtracking.

The motivating contrast: CoT samples one linear reasoning path. If an early step goes wrong, the chain runs to a wrong answer without revisiting the bad step. Tree of Thoughts treats the problem as search, with the model acting as both the move generator and the evaluator.

The mechanism

Four components:

1. Thought decomposition

What is a "thought" for this task? A line of math, a sentence in a writing task, a tile move in a puzzle. The decomposition determines the granularity of the search.

2. Thought generation

At each search-tree node, the model produces multiple candidate next thoughts. Two strategies are common: sample diverse continuations from the same prompt, or propose distinct alternatives in a single call.

3. State evaluation

The model judges the value of each candidate thought. The judgment takes the shape of a scalar score (rate this state on a scale), a comparative ranking, or a classification (sure / maybe / impossible). The evaluator is itself an LLM call.

4. Search algorithm

A standard search algorithm (BFS, DFS, beam search) walks the tree, expanding promising states and pruning hopeless ones. Yao et al. used BFS and DFS depending on the task structure.

When it helps

Tree of Thoughts is appropriate when:

  • The task has structure that admits intermediate-state evaluation. Math, puzzles, planning, code generation with checkable subgoals all fit.
  • Early errors are expensive (the chain commits and cannot recover gracefully).
  • The cost of search (many LLM calls per query) is acceptable relative to the accuracy gain.

The original paper reported large gains on Game of 24 (a small math puzzle benchmark), Creative Writing (where multiple plans compete), and Mini Crosswords. CoT solved Game of 24 about 4% of the time. Tree of Thoughts solved it about 74%.

When it does not help

  • Tasks without checkable intermediate states. Free-form generation where every continuation is plausible offers the evaluator no signal. The search collapses.
  • Tasks where the model cannot self-evaluate reliably. Tree of Thoughts depends on the evaluator being calibrated. A model that rates every state as "promising" cannot prune.
  • Latency-sensitive interactive systems. The cost of search is N-fold more LLM calls. Wall-clock time grows accordingly.

The faithfulness issue from Turpin et al. (2023) extends to the evaluator: the model's stated reason for preferring one branch does not necessarily match the actual basis of its preference. The search is only as honest as the evaluator.

Besta et al. (2024) extended the tree to a Graph of Thoughts, allowing thoughts to form arbitrary dependency graphs rather than only linear chains or trees. The graph generalization fits tasks where intermediate results re-enter the reasoning later (merging two computed quantities back into a third).

  • CoT produces one linear chain.
  • Self-consistency produces N independent linear chains, then votes.
  • Least-to-most decomposes once, then solves a linear sequence.
  • Tree of Thoughts searches a tree of candidates with lookahead and backtracking.
  • Graph of Thoughts generalizes to arbitrary dependency graphs.

The pattern hierarchy adds expressiveness and cost in roughly that order.

Practical guidance

A few rules of thumb:

  • Reach for ToT when the task has clear intermediate states and CoT under-performs. Do not adopt it as a default.
  • Validate that the model evaluates intermediate states reliably for the specific task before scaling up. A weak evaluator wastes the search budget.
  • Cap the search depth and branching factor. Unbounded search burns through LLM cost without proportional gain.
  • For tasks where intermediate states are runnable code, prefer the executable check over an LLM-judged one. The interpreter does not hallucinate.