Language ModelsUpdated April 2026

LLM Benchmarks

Frontier model performance across knowledge, coding, agentic tool use, and extreme difficulty evaluations. Real leaderboard data with source links.

5
Benchmarks Tracked
87.7%
GPQA Diamond SOTA
85.1%
MMLU-Pro SOTA
38.3%
HLE SOTA

Benchmark Overview

Each benchmark probes a distinct capability — from breadth of knowledge to sustained tool-use reasoning.

BenchmarkCategorySOTAModels
MMLU-ProKnowledge85.1%11
GPQA DiamondKnowledge87.7%10
LiveCodeBenchCoding72.6%10
Tau2-BenchAgentic & Tools79% (avg)8
HLE (no tools)Frontier Difficulty38.3%10
Knowledge

MMLU-Pro

Harder version of MMLU — 10-choice MCQ with distractors, covering 57 subjects. Reduces reliance on surface pattern-matching vs. the original 4-choice format. 12,000 questions.

Knowledge
#ModelAccuracy
Claude 3.7 Sonnet
Anthropic
85.1%
2
Gemini 2.5 Pro
Google
83.7%
3
o3-mini (high)
OpenAI
79.3%
4
Claude 3.5 Sonnet
Anthropic
76.1%
5
GPT-4o
OpenAI
72.6%
6
Gemini 1.5 Pro
Google
69%
7
Claude 3 Opus
Anthropic
68.5%
8
GPT-4 Turbo
OpenAI
63.7%
9
Gemini 1.5 Flash
Google
59.1%
10
Llama 3 70B Instruct
Meta
56.2%
11
DeepSeek V2 Chat
DeepSeek
54.8%

Source: TIGER-AI-Lab/MMLU-Pro · 5-shot chain-of-thought evaluation.

GPQA Diamond

198 expert-authored graduate-level questions in biology, chemistry, and physics. PhD-level specialists score ~65% on their own field. Designed to be impossible to Google.

Knowledge
#ModelAccuracy
o3
OpenAI
87.7%
2
Claude 3.7 Sonnet
Anthropic
84.8%
3
Gemini 2.0 Flash Thinking
Google
80.5%
4
o1 pro
OpenAI
78%
5
o1
OpenAI
77.3%
6
DeepSeek-R1
DeepSeek
71.5%
7
Claude 3.5 Sonnet (new)
Anthropic
65%
8
Claude 3.5 Sonnet
Anthropic
59.4%
9
GPT-4o
OpenAI
53.6%
10
Gemini 1.5 Pro
Google
46.2%

Source: arXiv:2311.12022 · Human expert baseline (non-specialist): 34%. PhD specialist: ~65%.

Coding

LiveCodeBench

Continuously updated with new contest problems from LeetCode, Codeforces, and AtCoder — eliminating data contamination. Tests code generation, debugging, and self-repair.

Coding
#ModelPass@1
o3-mini (high)
OpenAI
72.6%
2
Claude 3.7 Sonnet
Anthropic
68.9%
3
Gemini 2.5 Pro
Google
67.4%
4
DeepSeek-R1
DeepSeek
65.9%
5
o1
OpenAI
63.4%
6
Claude 3.5 Sonnet (new)
Anthropic
60.8%
7
GPT-4o (Nov)
OpenAI
54.3%
8
Gemini 1.5 Pro
Google
50.8%
9
Claude 3.5 Sonnet
Anthropic
49.2%
10
GPT-4o
OpenAI
47.1%

Source: livecodebench.github.io · Problems released after model training cutoffs to prevent contamination.

Agentic & Tools

Tau2-Bench

Simulates real customer service interactions — agents use tools and databases to resolve tasks in retail and airline domains across multi-turn dialogues. Pass rate = task fully resolved.

Agentic & Tools
#ModelAvg Pass Rate
Claude Opus 4.5
Anthropic
79%
2
GPT-5.2
OpenAI
73%
3
Gemini 3 Pro
Google
69%
4
Claude Sonnet 4.5
Anthropic
63%
5
GPT-5.1
OpenAI
59%
6
Gemini 2.5 Pro
Google
54%
7
Claude 3.7 Sonnet
Anthropic
47%
8
GPT-4o
OpenAI
36%

Source: sierra-research/tau2-bench · Average across 3 seeds per model.

Frontier Difficulty

Humanity's Last Exam (HLE)

3,000 extremely hard questions across math, science, law, and humanities — contributed by domain experts worldwide. Designed to remain unsaturated for years. No tools allowed in this variant.

Frontier Difficulty
#ModelAccuracy
Gemini 3 Pro
Google
38.3%
2
GPT-5
OpenAI
25.3%
3
Grok 4
xAI
24.5%
4
Gemini 2.5 Pro
Google
21.6%
5
GPT-5-mini
OpenAI
19.4%
6
Claude 4.5 Sonnet
Anthropic
13.7%
7
Gemini 2.5 Flash
Google
12.1%
8
DeepSeek-R1
DeepSeek
8.5%
9
o1
OpenAI
8%
10
GPT-4o
OpenAI
2.7%
Calibration Error = how over-confident a model is. High CE means the model says it's sure when it's wrong. Lower is better.

Source: agi.safe.ai · Leaderboard as of April 2025.

Missing a benchmark or result?

We track LLM benchmarks as new evaluations and model results are published.

Submit Results

Related