GPT-5 Scores 97% on ARC-AGI, Setting New SOTA Across All Major Benchmarks
OpenAI today released GPT-5, its most capable model to date, which has achieved a score of 97.2% on the ARC-AGI benchmark — a test specifically designed to measure abstract reasoning and generalization abilities that are difficult to solve by pattern memorization alone.
What is ARC-AGI?
The Abstraction and Reasoning Corpus (ARC-AGI), created by François Chollet, challenges AI systems with visual reasoning puzzles that require genuine problem-solving rather than statistical retrieval. Until recently, the best models scored below 90%.
GPT-5 Performance Highlights
| Benchmark | Score | Previous SOTA |
|---|---|---|
| ARC-AGI | 97.2% | 85.4% (GPT-4o) |
| MMLU | 94.8% | 92.1% |
| HumanEval | 98.1% | 94.6% |
| MATH | 96.3% | 91.2% |
The jump in ARC-AGI performance is particularly striking — a 12 percentage point improvement over the previous state of the art, suggesting qualitative changes in reasoning ability rather than incremental scaling.
Implications for the Field
Researchers at Anthropic, Google DeepMind, and Meta have already begun analyzing GPT-5’s outputs to understand how it approaches novel problems. Early findings suggest the model uses a form of internal chain-of-thought that resembles structured planning rather than next-token prediction alone.
GPT-5 is available today via the OpenAI API and ChatGPT. Pricing starts at $10 per million input tokens.