Harder version of MMLU with 10-choice multiple-choice questions across 57 subjects and 12,000 questions. Reduces sensitivity to prompt format and increases reasoning difficulty.
Higher is better
| Rank | Model | Trust | Score | Year | Source |
|---|---|---|---|---|---|
| 01 | Gemini 3.1 Pro | unverified | 90.99 | 2026 | Source ↗ |
| 02 | Gemini 3 Pro | unverified | 89.8 | 2026 | Source ↗ |
| 03 | Claude Opus 4.5 | unverified | 89.5 | 2026 | Source ↗ |
| 04 | Gemini 3 Flash | unverified | 89 | 2026 | Source ↗ |
| 05 | Qwen3.6 Plus | unverified | 88.5 | 2026 | Source ↗ |
| 06 | Claude Opus 4.1 | unverified | 88 | 2026 | Source ↗ |
| 07 | MiniMax M2.1 | unverified | 88 | 2026 | Source ↗ |
| 08 | Qwen3.5-397B-A17B | unverified | 87.8 | 2026 | Source ↗ |
| 09 | Claude Sonnet 4.5 | unverified | 87.5 | 2026 | Source ↗ |
| 10 | GPT-5.2 | unverified | 87.4 | 2026 | Source ↗ |
| 11 | Kimi K2.5 | unverified | 87.1 | 2026 | Source ↗ |
| 12 | GPT-5 | unverified | 87.1 | 2026 | Source ↗ |
| 13 | GPT-5.1 | unverified | 87 | 2026 | Source ↗ |
| 14 | Grok 4 | unverified | 86.6 | 2026 | Source ↗ |
| 15 | DeepSeek V3.2 | unverified | 86.2 | 2026 | Source ↗ |
| 16 | Claude 3.7 Sonnet | unverified | 85.1 | 2026 | Source ↗ |
| 17 | DeepSeek-R1-0528 | unverified | 85 | 2026 | Source ↗ |
| 18 | Kimi K2-Thinking-0905 | unverified | 84.6 | 2026 | Source ↗ |
| 19 | GLM-4.5 | unverified | 84.6 | 2026 | Source ↗ |
| 20 | GPT-4o | unverified | 72.6 | 2026 | Source ↗ |