April 2026 Coding Leaderboard โ HumanEval + SWE-Bench
| # | Model | Score |
|---|---|---|
| 1 | GPT-5 | 95.8 |
| 2 | Claude Opus 4.6 | 93.2 |
| 3 | Claude Sonnet 4.6 | 91.4 |
| 4 | Llama 4 Maverick | 90.1 |
| 5 | Gemini 2.5 Pro | 87.9 |
| 6 | GPT-4o | 86.5 |
| 7 | Codestral 2.5 | 85.3 |
| 8 | Llama 4 Scout | 80.2 |
| 9 | Qwen 3 Coder 72B | 79.8 |
| 10 | DeepSeek Coder V3 | 77.4 |
Coding performance is the metric that matters most for AI-assisted development. This leaderboard combines HumanEval (function generation) with SWE-Bench (fixing real GitHub issues) for the most comprehensive coding evaluation available.
Benchmark Weights
| Benchmark | Weight | What It Tests |
|---|---|---|
| HumanEval | 40% | Function-level code generation from docstrings |
| SWE-Bench | 60% | Fixing real-world GitHub issues in production repos |
SWE-Bench receives higher weight because it measures practical utility on real codebases, not just synthetic tasks.
Key Observations
GPT-5 leads overall but the gap over Claude Opus 4.6 is narrow (2.6 points). In practice, both models perform comparably on typical engineering tasks โ Claude Sonnet 4.6 at #3 is often the better value play.
Llama 4 Maverick at #4 confirms the open-source story: for teams that can self-host, Maverick delivers near-commercial coding capability at zero API cost.
Codestral 2.5 at #7 is notable โ Mistral’s code-specialized model punches above the weight class of general-purpose models its size.
SWE-Bench Scores (Standalone)
SWE-Bench requires models to navigate a full codebase, understand context, and produce diffs that pass CI. It’s the hardest benchmark in this list:
| Model | SWE-Bench (%) |
|---|---|
| GPT-5 | 71.3% |
| Claude Opus 4.6 | 68.9% |
| Claude Sonnet 4.6 | 64.2% |
| Llama 4 Maverick | 61.8% |