๐Ÿ“Š
Leaderboards

April 2026 Coding Leaderboard โ€” HumanEval + SWE-Bench

April 10, 2026 1 min read
Benchmark
HumanEval + SWE-Bench composite
Updated: 2026-04-10
#ModelScore
1GPT-595.8
2Claude Opus 4.693.2
3Claude Sonnet 4.691.4
4Llama 4 Maverick90.1
5Gemini 2.5 Pro87.9
6GPT-4o86.5
7Codestral 2.585.3
8Llama 4 Scout80.2
9Qwen 3 Coder 72B79.8
10DeepSeek Coder V377.4

Coding performance is the metric that matters most for AI-assisted development. This leaderboard combines HumanEval (function generation) with SWE-Bench (fixing real GitHub issues) for the most comprehensive coding evaluation available.

Benchmark Weights

BenchmarkWeightWhat It Tests
HumanEval40%Function-level code generation from docstrings
SWE-Bench60%Fixing real-world GitHub issues in production repos

SWE-Bench receives higher weight because it measures practical utility on real codebases, not just synthetic tasks.

Key Observations

GPT-5 leads overall but the gap over Claude Opus 4.6 is narrow (2.6 points). In practice, both models perform comparably on typical engineering tasks โ€” Claude Sonnet 4.6 at #3 is often the better value play.

Llama 4 Maverick at #4 confirms the open-source story: for teams that can self-host, Maverick delivers near-commercial coding capability at zero API cost.

Codestral 2.5 at #7 is notable โ€” Mistral’s code-specialized model punches above the weight class of general-purpose models its size.

SWE-Bench Scores (Standalone)

SWE-Bench requires models to navigate a full codebase, understand context, and produce diffs that pass CI. It’s the hardest benchmark in this list:

ModelSWE-Bench (%)
GPT-571.3%
Claude Opus 4.668.9%
Claude Sonnet 4.664.2%
Llama 4 Maverick61.8%