Leaderboards
April 2026 LLM Leaderboard โ Overall Performance
Benchmark
MMLU + HumanEval + MATH + ARC-AGI composite
Updated: 2026-04-11
| # | Model | Score |
|---|---|---|
| 1 | GPT-5 | 96.1 |
| 2 | Claude Opus 4.6 | 94.3 |
| 3 | Gemini Ultra 2 | 92.7 |
| 4 | Claude Sonnet 4.6 | 91.2 |
| 5 | Llama 4 Maverick | 90.8 |
| 6 | GPT-4o | 88.9 |
| 7 | Gemini 2.5 Pro | 87.4 |
| 8 | Llama 4 Scout | 85.6 |
| 9 | Mistral Large 3 | 83.2 |
| 10 | Qwen 3 72B | 81.9 |
This leaderboard combines four major benchmarks into a single composite score, weighted equally. Updated monthly as new models are released.
Methodology
Scores are averaged across four benchmarks:
| Benchmark | Weight | Measures |
|---|---|---|
| MMLU | 25% | General knowledge (57 subjects) |
| HumanEval | 25% | Code generation |
| MATH | 25% | Mathematical reasoning |
| ARC-AGI | 25% | Abstract reasoning / generalization |
All models are evaluated in zero-shot mode unless the benchmark specifies otherwise. We use the official benchmark implementations where available.
Notable Movements This Month
- GPT-5 debuts at #1 with a record 97.2% on ARC-AGI
- Llama 4 Maverick enters at #5 โ the highest any open-weight model has ever placed on our composite
- GPT-4o drops two positions as newer models surpass it
- Qwen 3 72B is new this month โ China’s strongest open-source entry
Open-Source Highlight
Llama 4 Maverick at #5 is the most significant open-source milestone yet. It outscores GPT-4o (#6) and closes within 5 points of Claude Opus 4.6 (#2) โ all with freely available weights.
Pricing per Million Tokens (Input)
| Model | $/1M input |
|---|---|
| GPT-5 | $10 |
| Claude Opus 4.6 | $15 |
| Gemini Ultra 2 | $7 |
| Claude Sonnet 4.6 | $3 |
| Llama 4 Maverick | Free (self-hosted) |