April 2026 Multimodal Leaderboard — Vision, Video & Audio

April 9, 2026 1 min read

Benchmark

MMMU + Video-QA + AudioBench composite

Updated: 2026-04-09

#	Model	Score
1	Gemini Ultra 2	94.8
2	Gemini 2.5 Pro	92.3
3	GPT-5	89.7
4	Claude Opus 4.6	87.2
5	GPT-4o	85.9
6	Claude Sonnet 4.6	83.4
7	Llama 4 Maverick	79.1
8	Qwen VL 72B	76.8
9	LLaVA 2.0 34B	71.2
10	Phi-4 Vision	68.5

Multimodal AI is the fastest-growing category. This leaderboard covers image understanding (MMMU), video reasoning (Video-QA), and audio processing (AudioBench) — combined into a single composite score.

Benchmark Breakdown

Benchmark	Weight	Tests
MMMU	40%	College-level multimodal questions requiring text + image reasoning
Video-QA	40%	Answering questions about video content (plot, events, objects)
AudioBench	20%	Audio transcription, emotion detection, speaker identification

Google’s Dominance

The multimodal leaderboard looks very different from the text leaderboard. Gemini Ultra 2 and 2.5 Pro hold the top two spots, with a meaningful gap over GPT-5 at #3.

Google’s advantage comes from native multimodality — Gemini was designed from the ground up to process multiple modalities simultaneously, rather than having vision and audio bolted on post-training.

Image-Only Rankings (MMMU standalone)

Model	MMMU Score
Gemini Ultra 2	96.2%
GPT-5	92.8%
Gemini 2.5 Pro	91.4%
Claude Opus 4.6	88.7%
Claude Sonnet 4.6	85.3%

Video Understanding — The New Frontier

Video reasoning remains the hardest multimodal task. The best model (Gemini Ultra 2) scores 93.1% on Video-QA; the weakest in our top 10 scores 61.4%. Expect rapid progress here in 2026 as video-native training becomes standard.