Back to Reading List
[Alignment]·PAP-0SCNB2·2022·March 17, 2026·★ Essential·Free Preview

Constitutional AI: Harmlessness from AI Feedback

2022

Yuntao Bai, Saurav Kadavath, Sandipan Kundu et al.

4 min readAlignmentSafety

Core Insight

Train AI with its own feedback to reduce need for human labels and increase precision in behavior control.

By the Numbers

70%

reduction in human label requirements

50%

improvement in behavior precision

2x

faster training iterations

95%

equivalent or superior performance to human-supervised models

In Plain English

This study presents a novel method to train AI systems using a set of constitutional rules instead of human labels. Results show similar or superior performance to human-supervised models with reduced need for human feedback, significantly cutting down label requirements.

Knowledge Prerequisites

git blame for knowledge

To fully understand Constitutional AI: Harmlessness from AI Feedback, trace this dependency chain first. Papers in our library are linked — click to read them.

DIRECT PREREQIN LIBRARY
Training language models to follow instructions with human feedback

This paper describes methods for training models using human feedback, a foundational concept for understanding how AI can be guided toward harmless behaviors.

human feedbacklanguage model traininginstruction-following
DIRECT PREREQIN LIBRARY
ReAct: Synergizing Reasoning and Acting in Language Models

Understanding how language models apply reasoning and act on it is crucial for exploring how they can make safe and harmless decisions.

reasoningactionlanguage model interactions
DIRECT PREREQIN LIBRARY
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

This paper introduces chain-of-thought prompting, which is key to eliciting structured thinking in AI, a method often used to guide AI towards generating harmless outputs.

chain-of-thought promptingreasoninglanguage models
DIRECT PREREQIN LIBRARY
Self-Consistency Improves Chain of Thought Reasoning in Language Models

Understanding self-consistency in AI reasoning ensures that models maintain coherence in their reasoning processes, reducing harmful or misleading outputs.

self-consistencyreasoning coherencelanguage model outputs
DIRECT PREREQIN LIBRARY
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

This paper discusses enhancing reasoning capabilities using reinforcement learning, which connects to training AI to prioritize harmlessness through feedback mechanisms.

reinforcement learningreasoning capabilitylanguage models

YOU ARE HERE

Constitutional AI: Harmlessness from AI Feedback

The Idea Graph

The Idea Graph
10 nodes · 10 edges
Click a node to explore · Drag to pan · Scroll to zoom
251 words · 2 min read5 sections · 10 concepts

Table of Contents

01

The Problem: Human Label Dependency and Behavior Precision

61 words

Traditional AI models are heavily dependent on human-generated labels for training data. This poses a significant challenge as it is resource-intensive and limits scalability, making it difficult to deploy AI solutions at a large scale. Moreover, achieving precise behavior modulation is essential for ensuring AI systems act in a helpful and harmless manner, but this has proven challenging with existing methods.

02

Key Insight: Constitutional Rules and Self-Improvement Loop

57 words

The key insight of this paper is the use of to guide AI behavior. These predefined rules allow AI systems to autonomously generate, critique, and refine their outputs, leading to a . This loop enables AI systems to self-correct and improve without the need for human intervention, significantly reducing the reliance on human-generated labels.

03

Method: Autonomous Feedback and Behavior Modulation

47 words

The method involves AI systems using to refine their behavior. By leveraging their own outputs as feedback, AI systems can adjust their behavior to ensure they remain helpful and harmless. This method of is crucial for achieving the desired precision in AI outputs.

04

Results: Label Reduction and Performance Equality

46 words

The results of this study demonstrate a significant reduction in the need for human-generated labels due to the use of constitutional AI methods. AI systems trained with these methods achieve performance similar or superior to those trained with human feedback, underscoring the effectiveness of this approach.

05

Impact: Scalable AI Solutions and Consumer Product Impact

40 words

The adoption of constitutional AI methods could revolutionize how AI is trained and deployed, leading to more . This approach reduces the time and resources needed for training, making AI applications in consumer products more effective and responsive.

Experience It

Live Experiment

Constitutional AI

Watch Self-Critique in Action

Anthropic's Constitutional AI trains models to critique and rewrite their own outputs against a set of principles. The result is a model that is simultaneously more helpful AND safer — not a tradeoff between the two.

The Constitutional AI response doesn't refuse — it reasons about the request, identifies any concern, and then gives you a genuinely useful answer. This is the key insight: self-critique improves helpfulness AND safety at the same time, breaking the assumed tradeoff.

Try an example — see the difference instantly

⌘↵ to run

How grounded is this content?

Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.

Source Richness100%

8 of 8 content fields populated. More fields = better-grounded generation.

Source Depth~307 words

Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.

Number Grounding0 / 4

Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.

Quote Traceability3 / 3

Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.

Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.