Codesota · Models · DeepSeek-V3DeepSeek15 results · 9 benchmarks

Model card

DeepSeek-V3.

DeepSeekopen-sourceLLM

DeepSeek's V3 model.

§ 01 · Benchmarks

Every benchmark DeepSeek-V3 has a recorded score for.

#	Benchmark	Area · Task	Metric	Value	Rank	Date	Source
01	MBPP	Computer Code · Code Generation	pass@1	89.3%	#9/19	—	source ↗
02	MATH	Reasoning · Mathematical Reasoning	accuracy	90.2%	#17/34	—	source ↗
03	GSM8K	Reasoning · Mathematical Reasoning	accuracy	95.8%	#18/32	—	source ↗
04	LiveCodeBench	Computer Code · Code Generation	pass@1	49.2%	#19/30	2024-03-12	source ↗
05	MMLU	Reasoning · Commonsense Reasoning	accuracy	88.5%	#22/41	—	source ↗
06	HumanEval	Computer Code · Code Generation	pass@1	82.6%	#32/42	—	source ↗
07	SWE-Bench Verified	Computer Code · Code Generation	resolve-rate	42.0%	#36/39	—	source ↗
08	PLCC	Natural Language Processing · Polish Cultural Competency	vocabulary	63.0%	#59/165	—	source ↗
09	PLCC	Natural Language Processing · Polish Cultural Competency	culture-and-tradition	73.0%	#60/165	—	source ↗
10	PLCC	Natural Language Processing · Polish Cultural Competency	art-and-entertainment	61.0%	#61/165	—	source ↗
11	PLCC	Natural Language Processing · Polish Cultural Competency	geography	79.0%	#61/165	—	source ↗
12	PLCC	Natural Language Processing · Polish Cultural Competency	average	69.2%	#67/165	—	source ↗
13	PLCC	Natural Language Processing · Polish Cultural Competency	history	77.0%	#69/165	—	source ↗
14	SWE-bench Verified	Agentic AI · SWE-bench	resolve-rate	42.0%	#70/81	—	source ↗
15	PLCC	Natural Language Processing · Polish Cultural Competency	grammar	62.0%	#77/165	—	source ↗

Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.

§ 02 · Strengths by area

Where DeepSeek-V3 actually performs.

Natural Language Processing

§ 03 · Papers

1 paper with results for DeepSeek-V3.

2024-03-12· Computer Code· 1 result
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

§ 04 · Related models

Other DeepSeek models scored on Codesota.

DeepSeek R1

671B MoE params · 10 results

DeepSeek-Coder-V2-Instruct

Unknown params · 4 results

685B MoE params · 2 results

§ 05 · Sources & freshness

Where these numbers come from.

sdadas/PLCC

results

arxiv

results

openai-simple-evals

results

deepseek-paper

result

official-leaderboard

result

swebench-leaderboard

result

editorial

result

11 of 15 rows marked verified.

DeepSeek-V3.

Every benchmark DeepSeek-V3 has a recorded score for.

Where DeepSeek-V3 actually performs.

1 paper with results for DeepSeek-V3.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Other DeepSeek models scored on Codesota.

Where these numbers come from.