Codesota · Models · GPT-4 TurboOpenAI13 results · 6 benchmarks
Model card

GPT-4 Turbo.

OpenAIapiUndisclosed params
§ 01 · Benchmarks

Every benchmark GPT-4 Turbo has a recorded score for.

#BenchmarkArea · TaskMetricValueRankDateSource
01HumanEvalComputer Code · Code Generationpass@188.2%#24/42source ↗
02HumanEvalComputer Code · Code Generationpass@186.6%#28/422023-11-01source ↗
03GPQAReasoning · Multi-step Reasoningaccuracy49.3%#29/33source ↗
04MATHReasoning · Mathematical Reasoningaccuracy73.4%#29/34source ↗
05MMLUReasoning · Commonsense Reasoningaccuracy86.7%#30/41source ↗
06SWE-BenchComputer Code · Code Generationresolve-rate12.5%#31/322024-03-01source ↗
07PLCCNatural Language Processing · Polish Cultural Competencyculture-and-tradition74.0%#57/165source ↗
08PLCCNatural Language Processing · Polish Cultural Competencyart-and-entertainment61.0%#61/165source ↗
09PLCCNatural Language Processing · Polish Cultural Competencygeography79.0%#61/165source ↗
10PLCCNatural Language Processing · Polish Cultural Competencyaverage67.0%#72/165source ↗
11PLCCNatural Language Processing · Polish Cultural Competencyhistory76.0%#73/165source ↗
12PLCCNatural Language Processing · Polish Cultural Competencyvocabulary56.0%#81/165source ↗
13PLCCNatural Language Processing · Polish Cultural Competencygrammar56.0%#93/165source ↗
Rank column shows this model’s position vs all other models scored on the same benchmark + metric (competitors after the slash). #1 in red means current SOTA. Sorted by rank, then newest result.
§ 02 · Strengths by area

Where GPT-4 Turbo actually performs.

Computer Code
2
benchmarks
avg rank #27.7
Reasoning
3
benchmarks
avg rank #29.3
Natural Language Processing
1
benchmark
avg rank #71.1
§ 03 · Papers

1 paper with results for GPT-4 Turbo.

  1. 2023-10-10· Computer Code· 1 result

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao et al.
§ 04 · Related models

Other OpenAI models scored on Codesota.

GPT-4o
Undisclosed params · 35 results · 9 SOTA
o3
16 results · 5 SOTA
o4-mini
13 results · 3 SOTA
o3 (high)
2 results · 1 SOTA
o4-mini (high)
1 result · 1 SOTA
o1
11 results
GPT-5
8 results
o1-preview
Undisclosed params · 8 results
§ 05 · Sources & freshness

Where these numbers come from.

sdadas/PLCC
7
results
openai-simple-evals
4
results
shadow-page-humaneval
1
result
sota-timeline
1
result
9 of 13 rows marked verified. · first result 2023-11-01, latest 2024-03-01.