February 2025
Claude 3.7 Sonnet and OpenAI o3
Anthropic introduces Claude 3.7 Sonnet with 'extended thinking'; OpenAI's o3 achieves human level on mathematical and scientific benchmarks.
Reasoning as a first-class capability
February 2025 saw two major releases that cemented reasoning as the defining frontier of AI. Anthropic released Claude 3.7 Sonnet with extended thinking — a mode in which the model explicitly shows its step-by-step reasoning process before giving a final answer. OpenAI released o3, the successor to o1, which achieved scores previously thought to require human expert knowledge on mathematical olympiad problems, the ARC-AGI benchmark, and PhD-level science questions.
Extended thinking
Claude 3.7 Sonnet's extended thinking allowed users to set a "thinking budget" — controlling how much reasoning the model performed before answering. On hard coding problems, mathematical reasoning, and multi-step logical tasks, extended thinking produced noticeably better results. It also made the model's reasoning transparent: users could see where it explored dead ends, corrected itself, and built toward a conclusion. Anthropic described this as a step toward more reliable, auditable AI reasoning.
OpenAI o3 and ARC-AGI
o3 achieved 87.5% on ARC-AGI (Abstract and Reasoning Corpus), a benchmark designed by François Chollet to test general fluid reasoning rather than memorization — a benchmark that o1 had scored only 32% on. It also achieved competitive performance on FrontierMath, a benchmark of novel mathematical problems compiled by professional mathematicians. These results reignited the debate about whether AI systems were approaching general intelligence or demonstrating increasingly sophisticated pattern matching at scale.