April 2026 Multimodal Leaderboard โ Vision, Video & Audio
| # | Model | Score |
|---|---|---|
| 1 | Gemini Ultra 2 | 94.8 |
| 2 | Gemini 2.5 Pro | 92.3 |
| 3 | GPT-5 | 89.7 |
| 4 | Claude Opus 4.6 | 87.2 |
| 5 | GPT-4o | 85.9 |
| 6 | Claude Sonnet 4.6 | 83.4 |
| 7 | Llama 4 Maverick | 79.1 |
| 8 | Qwen VL 72B | 76.8 |
| 9 | LLaVA 2.0 34B | 71.2 |
| 10 | Phi-4 Vision | 68.5 |
Multimodal AI is the fastest-growing category. This leaderboard covers image understanding (MMMU), video reasoning (Video-QA), and audio processing (AudioBench) โ combined into a single composite score.
Benchmark Breakdown
| Benchmark | Weight | Tests |
|---|---|---|
| MMMU | 40% | College-level multimodal questions requiring text + image reasoning |
| Video-QA | 40% | Answering questions about video content (plot, events, objects) |
| AudioBench | 20% | Audio transcription, emotion detection, speaker identification |
Google’s Dominance
The multimodal leaderboard looks very different from the text leaderboard. Gemini Ultra 2 and 2.5 Pro hold the top two spots, with a meaningful gap over GPT-5 at #3.
Google’s advantage comes from native multimodality โ Gemini was designed from the ground up to process multiple modalities simultaneously, rather than having vision and audio bolted on post-training.
Image-Only Rankings (MMMU standalone)
| Model | MMMU Score |
|---|---|
| Gemini Ultra 2 | 96.2% |
| GPT-5 | 92.8% |
| Gemini 2.5 Pro | 91.4% |
| Claude Opus 4.6 | 88.7% |
| Claude Sonnet 4.6 | 85.3% |
Video Understanding โ The New Frontier
Video reasoning remains the hardest multimodal task. The best model (Gemini Ultra 2) scores 93.1% on Video-QA; the weakest in our top 10 scores 61.4%. Expect rapid progress here in 2026 as video-native training becomes standard.