Back to Reading List
[Reasoning]·PAP-LRKYVF·March 17, 2026·★ Essential·Free Preview

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans et al.

4 min readReasoning

Core Insight

Chain-of-Thought Prompting elevates reasoning in LLMs, outperforming finetuned GPT-3 on complex math tasks.

By the Numbers

540B

parameters in PaLM model

8

examples required for effective prompting

GSM8K

benchmark where PaLM excelled

outperformed finetuned GPT-3

comparison on complex math tasks

In Plain English

The paper demonstrates 'Chain-of-Thought' prompting, improving large language models' reasoning via intermediate steps. A PaLM 540B model excelled on GSM8K math benchmarks, even besting finetuned GPT-3 with a verifier.

Knowledge Prerequisites

git blame for knowledge

To fully understand Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, trace this dependency chain first. Papers in our library are linked — click to read them.

DIRECT PREREQIN LIBRARY
Attention Is All You Need

Understanding how transformers function is crucial for appreciating reasoning capabilities in large language models.

attention mechanismtransformer architectureself-attention
DIRECT PREREQIN LIBRARY
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

This paper introduces early advancements in transformer-based language models, foundational for understanding how large models evolve reasoning skills.

bidirectional transformersmasked language modelingtransfer learning
DIRECT PREREQIN LIBRARY
Language Models are Few-Shot Learners

Understanding few-shot learning helps grasp how large language models can adapt their reasoning processes without extensive training.

few-shot learningin-context learningprompting
DIRECT PREREQIN LIBRARY
Training Compute-Optimal Large Language Models

This paper discusses the methods to efficiently train large models, which directly affects their ability to perform complex reasoning tasks.

compute efficiencymodel scalingtraining optimization
DIRECT PREREQIN LIBRARY
Self-Consistency Improves Chain of Thought Reasoning in Language Models

The concept of self-consistency is a refinement of existing chain-of-thought methods, critical for an advanced understanding of reasoning improvements.

self-consistencychain of thoughtreasoning enhancement

YOU ARE HERE

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

The Idea Graph

The Idea Graph
12 nodes · 13 edges
Click a node to explore · Drag to pan · Scroll to zoom
419 words · 3 min read7 sections · 12 concepts

Table of Contents

01

The Problem: Lack of Complex Reasoning in Models

78 words

Large language models, despite their impressive capabilities, have struggled with tasks that require complex reasoning. This limitation has been particularly evident in problems like math word problems, where models often fail to provide accurate solutions. The '' in these models highlights a significant gap in AI's ability to process and solve intricate tasks effectively. Existing approaches have not sufficiently addressed this issue, calling for new methods to unlock the latent reasoning potential of these large models.

02

Key Insight: The Power of Prompting

62 words

The critical insight presented in this paper is the '', which involves guiding models through structured reasoning steps. This approach leverages the '' of large models, where their ability to reason becomes more apparent when prompted effectively. By rethinking how we interact with these models, researchers have identified a way to enhance their natural reasoning abilities, which was previously untapped.

03

Method: Chain-of-Thought Prompting

60 words

The '' method involves using prompts that guide models step-by-step through a reasoning process. By presenting '', the model is able to break down complex tasks into manageable components, improving its ability to reason through problems. A '', consisting of just eight examples, can dramatically transform a model's reasoning capabilities, revealing the potential that was previously obscured.

04

Method: Application on the PaLM 540B Model

51 words

The '' is a large language model that benefited significantly from chain-of-thought prompting. By applying these methods, the model demonstrated remarkable performance improvements, showcasing its ability to handle complex reasoning tasks. This section explores how the combination of effective prompting and the model's inherent capabilities led to these breakthroughs.

05

Results: Outperforming on GSM8K Benchmarks

56 words

The experiments demonstrated that the 'PaLM 540B Model' outperformed existing state-of-the-art methods on the '', a collection of challenging math word problems. Notably, this model even surpassed the '', highlighting the effectiveness of the proposed prompting techniques. These results underscore the potential of structured prompting to elevate model performance on tasks requiring advanced reasoning.

06

Impact: A Paradigm Shift in Model Training

54 words

The success of this approach signifies a '' in how we train and prompt language models for reasoning tasks. By prioritizing both scale and strategic prompting, researchers can unlock new potentials in AI systems. This shift encourages a reevaluation of existing methodologies, promoting techniques that can better harness the capabilities of large models.

07

Impact: Enabling Advanced AI Applications

58 words

The development of chain-of-thought prompting opens new possibilities for '' that require advanced reasoning. Applications such as complex Q&A systems, tutoring platforms, and decision support tools stand to benefit significantly. Additionally, this approach can lead to an '', where AI becomes a more effective collaborator in tasks that demand nuanced understanding, reshaping our interactions with technology.

Experience It

Live Experiment

Chain-of-Thought Prompting

See Chain-of-Thought in Action

Wei et al. showed that asking a model to "think step by step" dramatically improves reasoning on math, logic, and common-sense problems. Enter any puzzle and see the difference yourself.

Notice the direct answer often triggers the intuitive (wrong) answer. Step-by-step reasoning forces the model — and you — to catch the error. Wei et al. showed this works at scale across dozens of reasoning benchmarks.

Try an example — see the difference instantly

⌘↵ to run

How grounded is this content?

Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.

Source Richness100%

8 of 8 content fields populated. More fields = better-grounded generation.

Source Depth~223 words

Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.

Number Grounding4 / 4

Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.

Quote Traceability3 / 3

Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.

Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.