April 2026 Coding Leaderboard — HumanEval + SWE-Bench

April 10, 2026 1 min read

Benchmark

HumanEval + SWE-Bench composite

Updated: 2026-04-10

#	Model	Score
1	GPT-5	95.8
2	Claude Opus 4.6	93.2
3	Claude Sonnet 4.6	91.4
4	Llama 4 Maverick	90.1
5	Gemini 2.5 Pro	87.9
6	GPT-4o	86.5
7	Codestral 2.5	85.3
8	Llama 4 Scout	80.2
9	Qwen 3 Coder 72B	79.8
10	DeepSeek Coder V3	77.4

Coding performance is the metric that matters most for AI-assisted development. This leaderboard combines HumanEval (function generation) with SWE-Bench (fixing real GitHub issues) for the most comprehensive coding evaluation available.

Benchmark Weights

Benchmark	Weight	What It Tests
HumanEval	40%	Function-level code generation from docstrings
SWE-Bench	60%	Fixing real-world GitHub issues in production repos

SWE-Bench receives higher weight because it measures practical utility on real codebases, not just synthetic tasks.

Key Observations

GPT-5 leads overall but the gap over Claude Opus 4.6 is narrow (2.6 points). In practice, both models perform comparably on typical engineering tasks — Claude Sonnet 4.6 at #3 is often the better value play.

Llama 4 Maverick at #4 confirms the open-source story: for teams that can self-host, Maverick delivers near-commercial coding capability at zero API cost.

Codestral 2.5 at #7 is notable — Mistral’s code-specialized model punches above the weight class of general-purpose models its size.

SWE-Bench Scores (Standalone)

SWE-Bench requires models to navigate a full codebase, understand context, and produce diffs that pass CI. It’s the hardest benchmark in this list:

Model	SWE-Bench (%)
GPT-5	71.3%
Claude Opus 4.6	68.9%
Claude Sonnet 4.6	64.2%
Llama 4 Maverick	61.8%