Codesota · Models · DeepSeek R1DeepSeek19 results · 13 benchmarks
Model card

DeepSeek R1.

DeepSeekopen-source671B MoE params
§ 01 · Benchmarks

Every benchmark DeepSeek R1 has a recorded score for.

#BenchmarkArea · TaskMetricValueRankDateSource
01AIME 2025Reasoning · Mathematical Reasoningaccuracy72.0%#5/5source ↗
02ARC-ChallengeReasoning · Commonsense Reasoningaccuracy97.1%#5/10source ↗
03AIME 2024Reasoning · Mathematical Reasoningaccuracy79.8%#6/8source ↗
04MATHReasoning · Mathematical Reasoningaccuracy97.3%#6/34source ↗
05MMLUReasoning · Commonsense Reasoningaccuracy90.8%#8/412025-01-22source ↗
06LiveCodeBench ProComputer Code · Code Generationelo1161.00#8/9source ↗
07HLEReasoning · Multi-step Reasoningaccuracy8.5%#10/13unverified
08LiveCodeBenchComputer Code · Code Generationpass@165.9%#10/30source ↗
09SWE-BenchComputer Code · Code Generationresolve-rate76.3%#13/322025-12-01source ↗
10GPQAReasoning · Multi-step Reasoningaccuracy71.5%#16/33source ↗
11GSM8KReasoning · Mathematical Reasoningaccuracy97.3%#16/32source ↗
12SWE-Bench VerifiedComputer Code · Code Generationresolve-rate49.2%#33/39source ↗
13PLCCNatural Language Processing · Polish Cultural Competencygrammar74.0%#34/165source ↗
14PLCCNatural Language Processing · Polish Cultural Competencyvocabulary72.0%#39/165source ↗
15PLCCNatural Language Processing · Polish Cultural Competencyhistory85.0%#40/165source ↗
16PLCCNatural Language Processing · Polish Cultural Competencygeography84.0%#45/165source ↗
17PLCCNatural Language Processing · Polish Cultural Competencyaverage76.0%#45/165source ↗
18PLCCNatural Language Processing · Polish Cultural Competencyart-and-entertainment66.0%#47/165source ↗
19PLCCNatural Language Processing · Polish Cultural Competencyculture-and-tradition75.0%#53/165source ↗
Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.
§ 02 · Strengths by area

Where DeepSeek R1 actually performs.

Reasoning
8
benchmarks
avg rank #9.0
Computer Code
4
benchmarks
avg rank #16.0
Natural Language Processing
1
benchmark
avg rank #43.3
§ 03 · Papers

1 paper with results for DeepSeek R1.

  1. 2023-10-10· Computer Code· 1 result

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao et al.
§ 04 · Related models

Other DeepSeek models scored on Codesota.

DeepSeek-V3
7 results
DeepSeek-Coder-V2-Instruct
Unknown params · 4 results
DeepSeek-OCR
3 results
DeepSeek-R1-0528
3 results
DeepSeek V3.5
685B MoE params · 2 results
DeepSeek-V2.5
2 results
DeepSeek-V3.1
2 results
DeepSeek V3.2
1 result
§ 05 · Sources & freshness

Where these numbers come from.

sdadas/PLCC
7
results
arxiv
6
results
swebench-leaderboard
2
results
deepseek-paper
1
result
livecodebench-pro-official
1
result
editorial
1
result
arxiv-2501.12948
1
result
16 of 19 rows marked verified. · first result 2025-01-22, latest 2025-12-01.