Editing Claude AI (section)

=== Constitutional AI ===

Anthropic's principal technical contribution to AI alignment is '''Constitutional AI''' (CAI), introduced in a December 2022 paper. CAI is a training methodology in which a language model is guided by a set of written principles — a "constitution" — and trained to self-critique and revise its own outputs against those principles.<ref name="cai-paper">Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic.</ref>

The process involves two stages:
# '''Supervised learning''' — The model generates responses and revises them according to the constitutional principles, producing a curated dataset.
# '''Reinforcement learning from AI feedback''' (RLAIF) — A second model instance acts as a "critic," judging whether responses comply with the constitution. The original model is then trained to maximize these AI-generated preference signals.

RLAIF effectively automates the human-evaluator step in conventional [[reinforcement learning from human feedback]] (RLHF), making alignment training more scalable. Claude 2's constitution drew on sources including the 1948 [[Universal Declaration of Human Rights]] and Apple's Terms of Service, among others.<ref name="anthropic-wiki"/> Claude's current constitution (updated 2026) articulates not just rules but the reasoning and values behind them, with Anthropic stating an intent for it to serve as a model for the broader industry.