LLM Benchmarks
Frontier model performance across knowledge, coding, agentic tool use, and extreme difficulty evaluations. Real leaderboard data with source links.
Benchmark Overview
Each benchmark probes a distinct capability — from breadth of knowledge to sustained tool-use reasoning.
| Benchmark | Category | SOTA | Models |
|---|---|---|---|
| MMLU-Pro | Knowledge | 85.1% | 11 |
| GPQA Diamond | Knowledge | 87.7% | 10 |
| LiveCodeBench | Coding | 72.6% | 10 |
| Tau2-Bench | Agentic & Tools | 79% (avg) | 8 |
| HLE (no tools) | Frontier Difficulty | 38.3% | 10 |
MMLU-Pro
Harder version of MMLU — 10-choice MCQ with distractors, covering 57 subjects. Reduces reliance on surface pattern-matching vs. the original 4-choice format. 12,000 questions.
| # | Model | Accuracy |
|---|---|---|
| ★ | Claude 3.7 Sonnet Anthropic | 85.1% |
| 2 | Gemini 2.5 Pro Google | 83.7% |
| 3 | o3-mini (high) OpenAI | 79.3% |
| 4 | Claude 3.5 Sonnet Anthropic | 76.1% |
| 5 | GPT-4o OpenAI | 72.6% |
| 6 | Gemini 1.5 Pro Google | 69% |
| 7 | Claude 3 Opus Anthropic | 68.5% |
| 8 | GPT-4 Turbo OpenAI | 63.7% |
| 9 | Gemini 1.5 Flash Google | 59.1% |
| 10 | Llama 3 70B Instruct Meta | 56.2% |
| 11 | DeepSeek V2 Chat DeepSeek | 54.8% |
Source: TIGER-AI-Lab/MMLU-Pro · 5-shot chain-of-thought evaluation.
GPQA Diamond
198 expert-authored graduate-level questions in biology, chemistry, and physics. PhD-level specialists score ~65% on their own field. Designed to be impossible to Google.
| # | Model | Accuracy |
|---|---|---|
| ★ | o3 OpenAI | 87.7% |
| 2 | Claude 3.7 Sonnet Anthropic | 84.8% |
| 3 | Gemini 2.0 Flash Thinking Google | 80.5% |
| 4 | o1 pro OpenAI | 78% |
| 5 | o1 OpenAI | 77.3% |
| 6 | DeepSeek-R1 DeepSeek | 71.5% |
| 7 | Claude 3.5 Sonnet (new) Anthropic | 65% |
| 8 | Claude 3.5 Sonnet Anthropic | 59.4% |
| 9 | GPT-4o OpenAI | 53.6% |
| 10 | Gemini 1.5 Pro Google | 46.2% |
Source: arXiv:2311.12022 · Human expert baseline (non-specialist): 34%. PhD specialist: ~65%.
LiveCodeBench
Continuously updated with new contest problems from LeetCode, Codeforces, and AtCoder — eliminating data contamination. Tests code generation, debugging, and self-repair.
| # | Model | Pass@1 |
|---|---|---|
| ★ | o3-mini (high) OpenAI | 72.6% |
| 2 | Claude 3.7 Sonnet Anthropic | 68.9% |
| 3 | Gemini 2.5 Pro Google | 67.4% |
| 4 | DeepSeek-R1 DeepSeek | 65.9% |
| 5 | o1 OpenAI | 63.4% |
| 6 | Claude 3.5 Sonnet (new) Anthropic | 60.8% |
| 7 | GPT-4o (Nov) OpenAI | 54.3% |
| 8 | Gemini 1.5 Pro Google | 50.8% |
| 9 | Claude 3.5 Sonnet Anthropic | 49.2% |
| 10 | GPT-4o OpenAI | 47.1% |
Source: livecodebench.github.io · Problems released after model training cutoffs to prevent contamination.
Tau2-Bench
Simulates real customer service interactions — agents use tools and databases to resolve tasks in retail and airline domains across multi-turn dialogues. Pass rate = task fully resolved.
| # | Model | Avg Pass Rate |
|---|---|---|
| ★ | Claude Opus 4.5 Anthropic | 79% |
| 2 | GPT-5.2 OpenAI | 73% |
| 3 | Gemini 3 Pro Google | 69% |
| 4 | Claude Sonnet 4.5 Anthropic | 63% |
| 5 | GPT-5.1 OpenAI | 59% |
| 6 | Gemini 2.5 Pro Google | 54% |
| 7 | Claude 3.7 Sonnet Anthropic | 47% |
| 8 | GPT-4o OpenAI | 36% |
Source: sierra-research/tau2-bench · Average across 3 seeds per model.
Humanity's Last Exam (HLE)
3,000 extremely hard questions across math, science, law, and humanities — contributed by domain experts worldwide. Designed to remain unsaturated for years. No tools allowed in this variant.
| # | Model | Accuracy |
|---|---|---|
| ★ | Gemini 3 Pro Google | 38.3% |
| 2 | GPT-5 OpenAI | 25.3% |
| 3 | Grok 4 xAI | 24.5% |
| 4 | Gemini 2.5 Pro Google | 21.6% |
| 5 | GPT-5-mini OpenAI | 19.4% |
| 6 | Claude 4.5 Sonnet Anthropic | 13.7% |
| 7 | Gemini 2.5 Flash Google | 12.1% |
| 8 | DeepSeek-R1 DeepSeek | 8.5% |
| 9 | o1 OpenAI | 8% |
| 10 | GPT-4o OpenAI | 2.7% |
Source: agi.safe.ai · Leaderboard as of April 2025.
Missing a benchmark or result?
We track LLM benchmarks as new evaluations and model results are published.