Benchmark comparison across open-weight models: DeepSeek-R1, Llama 3, Qwen 2.5, Mistral, Gemma 3. Run locally or self-host — no API fees.
Sorted by highest available score. MMLU = MMLU accuracy; MATH = MATH-500 accuracy; LCB = LiveCodeBench Pass@1.
| # | Model | Provider | Params | MMLU | MATH | LCB | License |
|---|---|---|---|---|---|---|---|
| ★ | DeepSeek-R1-Zero | DeepSeek | — | — | 95.9% | — | — |
| 2 | DeepSeek-R1-Distill-Llama-70B | DeepSeek | — | — | 94.5% | 65.2% | — |
| 3 | DeepSeek-R1-Distill-Qwen-32B | DeepSeek | — | — | 94.3% | 62.1% | — |
| 4 | DeepSeek-v3-0324 | DeepSeek | — | — | 94% | 49.2% | — |
| 5 | DeepSeek R1 | DeepSeek | 671B MoE | 90.8% | 97.3% | 65.9% | — |
| 6 | QwQ-32B | Alibaba/Qwen | — | — | 90.6% | — | — |
| 7 | Llama-4-Maverick | Meta | 400B total / 17B active (128 experts) | 89.4% | 89.4% | 43.4% | — |
| 8 | Qwen 3 72B | Alibaba | 72B | 88.7% | — | — | — |
| 9 | Llama 3.1 405B | Meta | — | 88.6% | 73.8% | — | — |
| 10 | DeepSeek-V3 | DeepSeek | — | 88.5% | 90.2% | 49.2% | — |
| 11 | DeepSeek V3.5 | DeepSeek | 685B MoE | 88.2% | — | — | — |
| 12 | Llama 4 405B | Meta | 405B | 87.8% | — | — | — |
| 13 | Mistral Large 3 | Mistral | 123B | 87.1% | — | — | — |
| 14 | MiniMax M2.5 | MiniMax | Unknown | 86.5% | — | — | — |
| 15 | Qwen2.5-72B-Instruct | Alibaba | 72B | 86.1% | 83.1% | — | — |
| 16 | Qwen 3 14B | Alibaba | 14B | 84.3% | — | — | — |
| 17 | Phi-4 14B | Microsoft | 14B | 83.9% | — | — | — |
| 18 | Llama 3.1 70B | Meta | — | 82% | 68% | — | — |
| 19 | DeepSeek-R1-0528 | DeepSeek | — | — | — | 73.3% | — |
| 20 | Qwen3-235B-A22B | Alibaba | 235B (22B active) | — | — | 70.7% | — |
| 21 | Qwen2.5-Coder 32B | Alibaba | 32B | — | — | 47.8% | — |
| 22 | DeepSeek-Coder-V2-Instruct | DeepSeek | Unknown | — | — | 43.4% | — |
| 23 | Gemma-3-27b | 27B | — | — | 39% | — | |
| 24 | Llama-4-Scout | Meta | 109B total / 17B active (16 experts) | — | — | 32.8% | — |
| 25 | Gemma 3 12B IT | Google DeepMind | 12B | — | — | 32% | — |
| 26 | Codestral 22B | Mistral | Unknown | — | — | 29.5% | — |
| 27 | Gemma 3 4B IT | Google DeepMind | 4B | — | — | 23% | — |
671B MoE model trained with reinforcement learning for chain-of-thought reasoning. Matches or exceeds GPT-4o on math and coding benchmarks. MIT license. Requires significant GPU to run locally.
Meta's most capable 70B model as of Dec 2024. Practical size for self-hosting on 2x A100 GPUs. Strong instruction following. Limited commercial use under Llama 3 license.
Alibaba's 72B model with excellent math and code performance. Apache 2.0 licensed. Strong multilingual capability including Chinese. Competitive with Llama 3.3 70B on most benchmarks.
Microsoft's 14B model punches above its weight on knowledge benchmarks. MIT license. Good choice for edge deployments.
Llama 3.3 70B or Qwen 2.5 72B are the best options for self-hosting — they fit on 2x A100 or 4x 4090 GPUs in 4-bit quantization. DeepSeek-R1 (671B) requires a full multi-GPU server. For edge devices, Phi-3 Medium 14B is the best quality/size tradeoff.
DeepSeek-R1 matches GPT-4o on math and GPQA Diamond, while being fully open-weight. However, proprietary frontier models like Claude 3.7 and o3 still lead on complex reasoning, agentic tasks, and HLE. The gap was ~2 years in 2023; it's now ~6-12 months.
Open-weight means the model weights are downloadable, but training code and data may not be released. True open-source includes training code and data. Most "open" models (Llama, Mistral) are open-weight. Only some (OLMo, Pythia) are fully open-source.