Open-Weight LLM Leaderboard.

Benchmark comparison across open-weight models: DeepSeek-R1, Llama 3, Qwen 2.5, Mistral, Gemma 3. Run locally or self-host — no API fees.

Leaderboard ↓Model notes FAQ

§ 01 · Multi-benchmark comparison

Open weights, three axes.

Sorted by highest available score. MMLU = MMLU accuracy; MATH = MATH-500 accuracy; LCB = LiveCodeBench Pass@1.

#	Model	Provider	Params	MMLU	MATH	LCB	License
★	DeepSeek-R1-Zero	DeepSeek	—	—	95.9%	—	—
2	DeepSeek-R1-Distill-Llama-70B	DeepSeek	—	—	94.5%	65.2%	—
3	DeepSeek-R1-Distill-Qwen-32B	DeepSeek	—	—	94.3%	62.1%	—
4	DeepSeek-v3-0324	DeepSeek	—	—	94%	49.2%	—
5	DeepSeek R1	DeepSeek	671B MoE	90.8%	97.3%	65.9%	—
6	QwQ-32B	Alibaba/Qwen	—	—	90.6%	—	—
7	Llama-4-Maverick	Meta	400B total / 17B active (128 experts)	89.4%	89.4%	43.4%	—
8	Qwen 3 72B	Alibaba	72B	88.7%	—	—	—
9	Llama 3.1 405B	Meta	—	88.6%	73.8%	—	—
10	DeepSeek-V3	DeepSeek	—	88.5%	90.2%	49.2%	—
11	DeepSeek V3.5	DeepSeek	685B MoE	88.2%	—	—	—
12	Llama 4 405B	Meta	405B	87.8%	—	—	—
13	Mistral Large 3	Mistral	123B	87.1%	—	—	—
14	MiniMax M2.5	MiniMax	Unknown	86.5%	—	—	—
15	Qwen2.5-72B-Instruct	Alibaba	72B	86.1%	83.1%	—	—
16	Qwen 3 14B	Alibaba	14B	84.3%	—	—	—
17	Phi-4 14B	Microsoft	14B	83.9%	—	—	—
18	Llama 3.1 70B	Meta	—	82%	68%	—	—
19	DeepSeek-R1-0528	DeepSeek	—	—	—	73.3%	—
20	Qwen3-235B-A22B	Alibaba	235B (22B active)	—	—	70.7%	—
21	Qwen2.5-Coder 32B	Alibaba	32B	—	—	47.8%	—
22	DeepSeek-Coder-V2-Instruct	DeepSeek	Unknown	—	—	43.4%	—
23	Gemma-3-27b	Google	27B	—	—	39%	—
24	Llama-4-Scout	Meta	109B total / 17B active (16 experts)	—	—	32.8%	—
25	Gemma 3 12B IT	Google DeepMind	12B	—	—	32%	—
26	Codestral 22B	Mistral	Unknown	—	—	29.5%	—
27	Gemma 3 4B IT	Google DeepMind	4B	—	—	23%	—

§ 02 · Model notes

Where each model lives.

DeepSeek-R1

Reasoning leader

671B MoE model trained with reinforcement learning for chain-of-thought reasoning. Matches or exceeds GPT-4o on math and coding benchmarks. MIT license. Requires significant GPU to run locally.

Llama 3.3 70B

Best accessible size

Meta's most capable 70B model as of Dec 2024. Practical size for self-hosting on 2x A100 GPUs. Strong instruction following. Limited commercial use under Llama 3 license.

Qwen 2.5 72B

Strong all-rounder

Alibaba's 72B model with excellent math and code performance. Apache 2.0 licensed. Strong multilingual capability including Chinese. Competitive with Llama 3.3 70B on most benchmarks.

Phi-3 Medium 14B

Efficient small model

Microsoft's 14B model punches above its weight on knowledge benchmarks. MIT license. Good choice for edge deployments.

§ 03 · Methodology

Frequently asked.

What is the best open-weight model to run locally in 2026?+

Llama 3.3 70B or Qwen 2.5 72B are the best options for self-hosting — they fit on 2x A100 or 4x 4090 GPUs in 4-bit quantization. DeepSeek-R1 (671B) requires a full multi-GPU server. For edge devices, Phi-3 Medium 14B is the best quality/size tradeoff.

How close are open models to frontier proprietary models?+

DeepSeek-R1 matches GPT-4o on math and GPQA Diamond, while being fully open-weight. However, proprietary frontier models like Claude 3.7 and o3 still lead on complex reasoning, agentic tasks, and HLE. The gap was ~2 years in 2023; it's now ~6-12 months.

What does "open-weight" vs "open-source" mean?+

Open-weight means the model weights are downloadable, but training code and data may not be released. True open-source includes training code and data. Most "open" models (Llama, Mistral) are open-weight. Only some (OLMo, Pythia) are fully open-source.

§ 04 · Related

Continue reading.

Reasoning

Reasoning Benchmarks

GPQA Diamond, MMLU-Pro, HLE

Math

Math Benchmarks

GSM8K, MATH-500, AIME 2024

Index

All LLM Benchmarks

Frontier model leaderboards