๐Ÿ“Š
Leaderboards

April 2026 Multimodal Leaderboard โ€” Vision, Video & Audio

April 9, 2026 1 min read
Benchmark
MMMU + Video-QA + AudioBench composite
Updated: 2026-04-09
#ModelScore
1Gemini Ultra 294.8
2Gemini 2.5 Pro92.3
3GPT-589.7
4Claude Opus 4.687.2
5GPT-4o85.9
6Claude Sonnet 4.683.4
7Llama 4 Maverick79.1
8Qwen VL 72B76.8
9LLaVA 2.0 34B71.2
10Phi-4 Vision68.5

Multimodal AI is the fastest-growing category. This leaderboard covers image understanding (MMMU), video reasoning (Video-QA), and audio processing (AudioBench) โ€” combined into a single composite score.

Benchmark Breakdown

BenchmarkWeightTests
MMMU40%College-level multimodal questions requiring text + image reasoning
Video-QA40%Answering questions about video content (plot, events, objects)
AudioBench20%Audio transcription, emotion detection, speaker identification

Google’s Dominance

The multimodal leaderboard looks very different from the text leaderboard. Gemini Ultra 2 and 2.5 Pro hold the top two spots, with a meaningful gap over GPT-5 at #3.

Google’s advantage comes from native multimodality โ€” Gemini was designed from the ground up to process multiple modalities simultaneously, rather than having vision and audio bolted on post-training.

Image-Only Rankings (MMMU standalone)

ModelMMMU Score
Gemini Ultra 296.2%
GPT-592.8%
Gemini 2.5 Pro91.4%
Claude Opus 4.688.7%
Claude Sonnet 4.685.3%

Video Understanding โ€” The New Frontier

Video reasoning remains the hardest multimodal task. The best model (Gemini Ultra 2) scores 93.1% on Video-QA; the weakest in our top 10 scores 61.4%. Expect rapid progress here in 2026 as video-native training becomes standard.