Codesota · Registry log9,080 rows · 7132 new this monthShowing 200
Editorial · Registry log

Every score we've added, in order.

The append-only public ledger of every benchmark result on Codesota. When a row was written, when the result itself is dated, who the model was, what value was claimed, and where the citation lives. New-SOTA rows are marked in colour; unverified rows still show, but labelled.

This is the audit trail. If a score is wrong, this is where the error will be visible; if a source is missing, this is where you'll see the gap.

Filters:New-SOTA onlyVerified only
2026-04-23 · 181 rows
  1. 20:40Gemini 3 FlashLiveCodeBench90.8%-0.90source ↗· verified· dated 2026-03-15
  2. 20:40Gemini 3 Pro PreviewLiveCodeBench91.7%NEW SOTA+6.70source ↗· verified· dated 2026-03-15
  3. 20:40Claude Opus 4.7SWE-Bench Verified87.6%NEW SOTA+6.70source ↗· verified· dated 2026-04-18
  4. 18:58Qwen3.6 PlusMMMU-Pro73.8%-8.20source ↗· verified· dated 2026-03-15
  5. 18:58GPT-5.1MMMU-Pro76.5%-5.50source ↗· verified· dated 2025-11-13
  6. 18:58Gemini 3 ProMMMU-Pro80.0%-2.00source ↗· verified· dated 2026-01-15
  7. 18:58GPT-5.2MMMU-Pro81.0%-1.00source ↗· verified· dated 2025-12-11
  8. 18:58Gemini 3.1 Pro PreviewMMMU-Pro82.0%NEW SOTAfirst resultsource ↗· verified· dated 2026-03-18
  9. 18:57Qwen3.5-27BMMMU82.3%-3.70source ↗· verified· dated 2025-09-01
  10. 18:57Qwen3.5-122B-A10BMMMU83.9%-2.10source ↗· verified· dated 2025-09-01
  11. 18:57Qwen3.5-397B-A17BMMMU83.9%-2.10source ↗· verified· dated 2025-09-01
  12. 18:57GPT-5.1MMMU85.4%-0.60source ↗· verified· dated 2025-11-13
  13. 18:57GPT-5.1 InstantMMMU85.4%-0.60source ↗· verified· dated 2025-11-13
  14. 18:57GPT-5.1 ThinkingMMMU85.4%-0.60source ↗· verified· dated 2025-11-13
  15. 18:57Qwen3.6 PlusMMMU86.0%NEW SOTA+12.70source ↗· verified· dated 2026-03-15
  16. 10:52GPT-4.5SWE-Bench62.0%-20.10no source· unverified· dated 2025-06-01
  17. 10:52Claude Opus 4SWE-Bench55.2%-26.90no source· unverified· dated 2025-03-01
  18. 10:52Claude 3.5 SonnetSWE-Bench49.0%-33.10no source· unverified· dated 2024-12-01
  19. 10:52GPT-4oSWE-Bench38.4%-43.70no source· unverified· dated 2024-11-01
  20. 10:52Amazon Q Developer AgentSWE-Bench36.2%-45.90no source· unverified· dated 2024-10-01
  21. 10:52Claude 3.5 SonnetSWE-Bench27.0%-55.10no source· unverified· dated 2024-08-01
  22. 10:52AutoCodeRoverSWE-Bench19.0%-63.10no source· unverified· dated 2024-06-01
  23. 10:52DevinSWE-Bench13.8%-68.30no source· unverified· dated 2024-05-01
  24. 10:52GPT-4SWE-Bench12.5%-69.60no source· unverified· dated 2024-03-01
  25. 10:52Claude 2SWE-Bench2.0%-80.14no source· unverified· dated 2023-10-01
  26. 10:52Qwen 3 72BSWE-Bench72.4%-9.70no source· unverified· dated 2025-10-01
  27. 10:52DeepSeek V3.5SWE-Bench74.6%-7.50no source· unverified· dated 2025-11-01
  28. 10:52Gemini 3 ProSWE-Bench76.2%-5.90no source· unverified· dated 2025-12-01
  29. 10:52Gemini 3 FlashSWE-Bench75.8%-6.30no source· unverified· dated 2026-02-01
  30. 10:52Claude Opus 4.5SWE-Bench76.8%-5.30no source· unverified· dated 2026-02-01
  31. 10:52Kimi K2.5SWE-Bench76.8%-5.30no source· unverified· dated 2026-01-01
  32. 10:52Claude Sonnet 4.5SWE-Bench77.2%-4.90no source· unverified· dated 2025-12-01
  33. 10:52Gemini 3 ProSWE-Bench77.4%-4.70source ↗· verified· dated 2026-01-01
  34. 10:52GLM-5SWE-Bench77.8%-4.30source ↗· verified· dated 2026-01-01
  35. 10:52Claude Opus 4.5SWE-Bench79.2%-2.90source ↗· verified· dated 2026-01-01
  36. 10:52Sonar FoundationSWE-Bench79.2%-2.90source ↗· verified· dated 2026-01-01
  37. 10:52GPT-5.2SWE-Bench80.0%-2.10source ↗· verified· dated 2026-02-01
  38. 10:52MiniMax M2.5SWE-Bench80.2%-1.90source ↗· verified· dated 2026-01-01
  39. 10:52Claude Opus 4.6SWE-Bench80.8%-1.30source ↗· verified· dated 2026-02-01
  40. 10:52Claude Opus 4.5SWE-Bench80.9%-1.20source ↗· verified· dated 2026-02-01
  41. 10:52BERT + AoASQuAD v2.088.6%-2.80source ↗· verified· dated 2019-03-01
  42. 10:52BERT (Google AI)SQuAD v2.083.1%-8.30source ↗· verified· dated 2018-11-01
  43. 10:52Logistic Regression (SQuAD baseline)SQuAD v2.051.0%-40.40source ↗· verified· dated 2016-06-01
  44. 10:52SLQA+ (single model)SQuAD v2.087.0%-4.38source ↗· verified· dated 2018-01-01
  45. 10:52Hanvon_model (single model)SQuAD v2.087.1%-4.28source ↗· verified· dated 2019-09-01
  46. 10:52Insight-baseline-BERT (single model)SQuAD v2.087.6%-3.76source ↗· verified· dated 2019-04-01
  47. 10:52XLNet (single, Verified XiaoPAI)SQuAD v2.088.0%-3.40source ↗· verified· dated 2019-09-01
  48. 10:52SpanBERT (single model)SQuAD v2.088.7%-2.69source ↗· verified· dated 2019-07-01
  49. 10:52BERT + DAE + AoA (single model)SQuAD v2.088.6%-2.78source ↗· verified· dated 2019-03-01
  50. 10:52XLNet+Verifier (single, Ping An)SQuAD v2.089.1%-2.34source ↗· verified· dated 2019-08-01
  51. 10:52XLNet+Verifier (single, Google/CMU)SQuAD v2.089.1%-2.32source ↗· verified· dated 2019-10-01
  52. 10:52BERT + ConvLSTM + MTL + Verifier (ensemble)SQuAD v2.089.3%-2.11source ↗· verified· dated 2019-03-01
  53. 10:52RoBERTa+Verify (single model)SQuAD v2.089.6%-1.81source ↗· verified· dated 2019-11-01
  54. 10:52Enhanced Albert+Verifier3 (ensemble)SQuAD v2.089.8%-1.62source ↗· verified· dated 2020-05-01
  55. 10:52RoBERTa (single model)SQuAD v2.089.8%-1.61source ↗· verified· dated 2020-07-01
  56. 10:51ERNIE 5.0GSM8K99.7%NEW SOTA+0.50source ↗· unverified· dated 2026-03-01
  57. 10:51GPT-5GSM8K99.2%NEW SOTA+0.20source ↗· unverified· dated 2025-08-01
  58. 10:51o1GSM8K97.8%-1.20source ↗· unverified· dated 2024-09-01
  59. 10:51GPT-4GSM8K92.0%-7.00source ↗· unverified· dated 2023-03-01
  60. 10:51PaLM 540B (Self-Consistency)GSM8K74.0%-25.00source ↗· unverified· dated 2022-01-01
  61. 10:51PaLM 540B (CoT)GSM8K58.0%-41.00source ↗· unverified· dated 2022-01-01
  62. 10:51GPT-3 (base)GSM8K8.0%-91.00source ↗· unverified· dated 2021-11-01
  63. 10:51Mixtral-8x22bGSM8K88.0%-11.00source ↗· unverified· dated 2024-04-01
  64. 10:51Claude 3 HaikuGSM8K88.9%-10.10source ↗· unverified· dated 2024-03-01
  65. 10:51GPT-4GSM8K92.0%-7.00source ↗· unverified· dated 2023-03-01
  66. 10:51Gemini UltraGSM8K94.4%-4.60source ↗· unverified· dated 2024-02-01
  67. 10:51Claude 3 OpusGSM8K95.0%-4.00source ↗· unverified· dated 2024-03-01
  68. 10:51Claude 3.5 SonnetGSM8K95.0%-4.00source ↗· unverified· dated 2024-07-01
  69. 10:51o1GSM8K97.8%-1.20source ↗· unverified· dated 2024-09-01
  70. 10:51GPT-4.5GSM8K98.2%-0.80source ↗· unverified· dated 2025-03-01
  71. 10:51Llama 4 Behemoth 2TGSM8K98.5%-0.50source ↗· unverified· dated 2025-04-01
  72. 10:51Claude 4GSM8K98.9%-0.10source ↗· unverified· dated 2025-05-01
  73. 10:51Claude Sonnet 5SWE-Bench82.1%=0.0source ↗· verified· dated 2026-02-01
  74. 10:51MiniMax M2.5SWE-Bench80.2%-1.90source ↗· verified· dated 2026-01-01
  75. 10:51Claude Opus 4.5SWE-Bench78.0%-4.10source ↗· verified· dated 2025-12-01
  76. 10:51Claude Sonnet 4.5SWE-Bench70.8%-11.30source ↗· verified· dated 2025-09-01
  77. 10:51GPT-4.5SWE-Bench62.0%-20.10source ↗· verified· dated 2025-06-01
  78. 10:51Claude Opus 4SWE-Bench55.2%-26.90source ↗· verified· dated 2025-03-01
  79. 10:51Claude 3.5 Sonnet v2SWE-Bench49.0%-33.10source ↗· verified· dated 2024-12-01
  80. 10:51o1-previewSWE-Bench36.2%-45.90source ↗· verified· dated 2024-10-01
  81. 10:51Claude 3.5 SonnetSWE-Bench27.0%-55.10source ↗· verified· dated 2024-08-01
  82. 10:51GPT-4oSWE-Bench19.0%-63.10source ↗· verified· dated 2024-06-01
  83. 10:51GPT-4 TurboSWE-Bench12.5%-69.60source ↗· verified· dated 2024-03-01
  84. 10:51Claude 2SWE-Bench2.0%-80.14source ↗· verified· dated 2023-10-01
  85. 10:51DeepSeek-Coder 33BSWE-Bench15.6%-66.50source ↗· verified· dated 2024-06-01
  86. 10:51StarCoder2 15BSWE-Bench18.3%-63.80source ↗· verified· dated 2024-10-01
  87. 10:51CodeLlama 70BSWE-Bench29.8%-52.30source ↗· verified· dated 2024-12-01
  88. 10:51Qwen2.5-Coder 32BSWE-Bench55.4%-26.70source ↗· verified· dated 2025-06-01
  89. 10:51DeepSeek-Coder V2.5SWE-Bench68.2%-13.90source ↗· verified· dated 2025-08-01
  90. 10:51Qwen3 72BSWE-Bench72.4%-9.70source ↗· verified· dated 2025-10-01
  91. 10:51Step-3.5-FlashSWE-Bench74.4%-7.70source ↗· verified· dated 2026-01-01
  92. 10:51DeepSeek V3.5SWE-Bench74.6%-7.50source ↗· verified· dated 2025-11-01
  93. 10:51Qwen3-Max-ThinkingSWE-Bench75.3%-6.80source ↗· verified· dated 2026-02-01
  94. 10:51Gemini 3 FlashSWE-Bench75.8%-6.30source ↗· verified· dated 2026-02-01
  95. 10:51DeepSeek R1SWE-Bench76.3%-5.80source ↗· verified· dated 2025-12-01
  96. 10:51Kimi K2.5SWE-Bench76.8%-5.30source ↗· verified· dated 2026-01-01
  97. 10:51Claude Sonnet 4.5SWE-Bench77.2%-4.90source ↗· verified· dated 2025-12-01
  98. 10:51Gemini 3 ProSWE-Bench77.4%-4.70source ↗· verified· dated 2026-01-01
  99. 10:51GLM-5SWE-Bench77.8%-4.30source ↗· verified· dated 2026-01-01
  100. 10:51Claude Opus 4.6SWE-Bench79.8%-2.30source ↗· verified· dated 2026-02-01
  101. 10:51GPT-5.2SWE-Bench80.0%-2.10source ↗· verified· dated 2026-02-01
  102. 10:51MiniMax M2.5SWE-Bench80.2%-1.90source ↗· verified· dated 2026-01-01
  103. 10:51Claude Opus 4.5SWE-Bench80.9%-1.20source ↗· verified· dated 2026-02-01
  104. 10:51Claude Sonnet 5SWE-Bench82.1%NEW SOTAfirst resultsource ↗· verified· dated 2026-02-01
  105. 10:51Phi-4 14BMMLU83.9%-9.00source ↗· unverified· dated 2025-08-01
  106. 10:51Qwen 3 14BMMLU84.3%-8.60source ↗· unverified· dated 2025-11-01
  107. 10:51Kimi K2.5MMLU86.0%-6.90source ↗· unverified· dated 2025-12-01
  108. 10:51MiniMax M2.5MMLU86.5%-6.40source ↗· unverified· dated 2026-01-01
  109. 10:51Mistral Large 3MMLU87.1%-5.80source ↗· unverified· dated 2025-10-01
  110. 10:51Llama 4 405BMMLU87.8%-5.10source ↗· unverified· dated 2025-09-01
  111. 10:51DeepSeek V3.5MMLU88.2%-4.70source ↗· unverified· dated 2025-10-01
  112. 10:51Qwen 3 72BMMLU88.7%-4.20source ↗· unverified· dated 2025-11-01
  113. 10:51Gemini 3 FlashMMLU89.6%-3.30source ↗· unverified· dated 2026-01-01
  114. 10:51Claude Sonnet 4.5MMLU90.4%-2.50source ↗· unverified· dated 2025-12-01
  115. 10:51GPT-5MMLU90.8%-2.10source ↗· unverified· dated 2025-09-01
  116. 10:51Claude Opus 4.6MMLU91.2%-1.70source ↗· unverified· dated 2026-03-01
  117. 10:51Gemini 3 ProMMLU91.4%-1.50source ↗· unverified· dated 2026-01-01
  118. 10:51Claude Opus 4.5MMLU91.8%-1.10source ↗· unverified· dated 2026-01-01
  119. 10:51GPT-5.2MMLU92.4%-0.50source ↗· unverified· dated 2026-02-01
  120. 10:51SENetImageNet97.8%NEW SOTA+1.32source ↗· verified· dated 2017-01-01
  121. 10:51ResNet-152ImageNet96.4%NEW SOTA+3.13source ↗· verified· dated 2015-01-01
  122. 10:51GoogLeNetImageNet93.3%NEW SOTA+2.30source ↗· verified· dated 2014-01-01
  123. 10:51AlexNetImageNet83.6%-7.40source ↗· verified· dated 2012-01-01
  124. 10:51NEC-UIUCImageNet71.8%-19.20source ↗· verified· dated 2010-01-01
  125. 10:51convnext_base.fb_in22k_ft_in1kImageNet86.3%-4.70source ↗· verified· dated 2022-01-01
  126. 10:51swin_large.ms_in22k_ft_in1kImageNet86.3%-4.67source ↗· verified· dated 2021-03-01
  127. 10:51nextvit_large.bd_ssld_6m_in1k_384ImageNet86.5%-4.46source ↗· verified· dated 2022-11-01
  128. 10:51coatnet_2_rw_224.sw_in12k_ft_in1kImageNet86.6%-4.42source ↗· verified· dated 2022-09-01
  129. 10:51maxvit_base_tf_512.in1kImageNet86.6%-4.40source ↗· verified· dated 2023-04-01
  130. 10:51InternViT-6B (InternVL)ImageNet88.2%-2.80source ↗· verified· dated 2024-06-01
  131. 10:51ViT-22B/14ImageNet89.5%-1.49source ↗· verified· dated 2023-02-01
  132. 10:51EVA-02 (ViT-L/14+)ImageNet90.0%-1.00source ↗· verified· dated 2023-03-01
  133. 10:51SoViT-400M/14ImageNet90.3%-0.70source ↗· verified· dated 2023-05-01
  134. 10:51CoCa (ViT-G/14)ImageNet91.0%NEW SOTAfirst resultsource ↗· verified· dated 2022-05-01
  135. 10:51DETRCOCO43.3%-22.82source ↗· unverified· dated 2020-05-26
  136. 10:51Mask R-CNNCOCO39.8%-26.32source ↗· unverified· dated 2017-03-20
  137. 10:51Faster R-CNNCOCO37.4%-28.72source ↗· unverified· dated 2015-06-04
  138. 10:51Swin-L (Cascade R-CNN)COCO58.9%-7.22source ↗· unverified· dated 2021-07-01
  139. 10:51ViT-Adapter-LCOCO60.5%-5.62source ↗· unverified· dated 2022-11-01
  140. 10:51DINO-ViT-LCOCO63.3%-2.82source ↗· unverified· dated 2023-03-01
  141. 10:51InternImage-H (OneFormer)COCO65.5%-0.62source ↗· unverified· dated 2024-03-01
  142. 10:51ThinkerCOCO66.0%-0.12source ↗· unverified· dated 2024-08-01
  143. 10:51SenseTime BasemodelCOCO66.0%-0.12source ↗· unverified· dated 2024-11-01
  144. 10:51CW_DetectionCOCO66.0%-0.12source ↗· unverified· dated 2025-01-01
  145. 10:51ScyllaNetCOCO66.1%NEW SOTA+0.12source ↗· unverified· dated 2025-09-01
  146. 10:51T5-11BGLUE89.3%-2.00source ↗· verified· dated 2019-10-01
  147. 10:51DeBERTa (ensemble)GLUE90.3%-1.00source ↗· verified· dated 2021-01-01
  148. 10:51ERNIE 3.0GLUE90.6%-0.70source ↗· verified· dated 2021-07-01
  149. 10:51ST-MoE-32BGLUE91.2%-0.10source ↗· verified· dated 2022-02-01
  150. 10:51Vega v2 (6B)GLUE91.3%NEW SOTAfirst resultsource ↗· verified· dated 2022-10-01
  151. 10:51Qianfan-OCROmniDocBench91.0%-6.48source ↗· unverified
  152. 10:51Qianfan-OCROmniDocBench92.4%-5.07source ↗· unverified
  153. 10:51Qianfan-OCROmniDocBench0.0%-97.46source ↗· unverified
  154. 10:51Qianfan-OCROmniDocBench93.1%-4.38source ↗· unverified
  155. 10:51GPT-4oOmniDocBench75.0%-22.48source ↗· unverified
  156. 10:51Dolphin-1.5OmniDocBench85.1%-12.44source ↗· unverified
  157. 10:51Dolphin-v2OmniDocBench89.8%-7.72source ↗· unverified
  158. 10:51clearOCROmniDocBench31.7%-65.80source ↗· verified
  159. 10:51PaddleOCR-VLOmniDocBench92.9%-4.64source ↗· unverified
  160. 10:51mistral-ocr-2512OmniDocBench79.8%-17.75source ↗· verified
  161. 10:51Mistral OCR 3OmniDocBench79.8%-17.75source ↗· verified
  162. 10:51dots.ocr 3BOmniDocBench88.4%-9.09source ↗· unverified
  163. 10:51OCRVerse 4BOmniDocBench88.6%-8.94source ↗· unverified
  164. 10:51Qwen2.5-VLOmniDocBench87.0%-10.48source ↗· unverified
  165. 10:51Gemini 2.5 ProOmniDocBench88.0%-9.47source ↗· unverified
  166. 10:51MonkeyOCR-pro-3BOmniDocBench88.8%-8.65source ↗· unverified
  167. 10:51Qwen3-VL-235BOmniDocBench89.2%-8.35source ↗· unverified
  168. 10:51MinerU 2.5OmniDocBench90.7%-6.83source ↗· unverified
  169. 10:51PaddleOCR-VL 0.9BOmniDocBench92.6%-4.94source ↗· unverified
  170. 10:51Codex (davinci-002)HumanEval46.9%-50.40source ↗· verified· dated 2021-07-01
  171. 10:51DeepSeek-Coder-33B-InstructHumanEval79.3%-18.00source ↗· verified· dated 2023-11-01
  172. 10:51Codestral 25.01HumanEval85.3%-12.00source ↗· verified· dated 2025-01-01
  173. 10:51GPT-4 TurboHumanEval86.6%-10.70source ↗· verified· dated 2023-11-01
  174. 10:51Llama-3.3-70B-InstructHumanEval88.4%-8.90source ↗· verified· dated 2024-12-01
  175. 10:51GPT-4oHumanEval90.2%-7.10source ↗· verified· dated 2024-05-01
  176. 10:51DeepSeek-Coder-V2-InstructHumanEval90.2%-7.10source ↗· verified· dated 2024-06-01
  177. 10:51Qwen2.5-Coder 32BHumanEval92.7%-4.60source ↗· verified· dated 2025-03-01
  178. 10:51Claude Sonnet 4.6HumanEval94.1%-3.20source ↗· verified· dated 2026-01-01
  179. 10:51o3HumanEval94.8%-2.50source ↗· verified· dated 2025-04-01
  180. 10:51GPT-5HumanEval95.1%-2.20source ↗· verified· dated 2025-12-01
  181. 10:51Claude Opus 4.6HumanEval96.3%-1.00source ↗· verified· dated 2026-01-01
2026-04-20 · 19 rows
  1. 14:23GPT-4oMMLU-Pro72.6%-18.39source ↗· unverified· dated 2026-04-20
  2. 14:23Claude 3.7 SonnetMMLU-Pro85.1%-5.89source ↗· unverified· dated 2026-04-20
  3. 14:23DeepSeek-R1-0528MMLU-Pro85.0%-5.99source ↗· unverified· dated 2026-04-20
  4. 14:23Kimi K2-Thinking-0905MMLU-Pro84.6%-6.39source ↗· unverified· dated 2026-04-20
  5. 14:23GLM-4.5MMLU-Pro84.6%-6.39source ↗· unverified· dated 2026-04-20
  6. 14:23DeepSeek V3.2MMLU-Pro86.2%-4.79source ↗· unverified· dated 2026-04-20
  7. 14:23Grok 4MMLU-Pro86.6%-4.39source ↗· unverified· dated 2026-04-20
  8. 14:23GPT-5.1MMLU-Pro87.0%-3.99source ↗· unverified· dated 2026-04-20
  9. 14:23GPT-5MMLU-Pro87.1%-3.89source ↗· unverified· dated 2026-04-20
  10. 14:23Kimi K2.5MMLU-Pro87.1%-3.89source ↗· unverified· dated 2026-04-20
  11. 14:23GPT-5.2MMLU-Pro87.4%-3.59source ↗· unverified· dated 2026-04-20
  12. 14:23Claude Sonnet 4.5MMLU-Pro87.5%-3.49source ↗· unverified· dated 2026-04-20
  13. 14:23Qwen3.5-397B-A17BMMLU-Pro87.8%-3.19source ↗· unverified· dated 2026-04-20
  14. 14:23MiniMax M2.1MMLU-Pro88.0%-2.99source ↗· unverified· dated 2026-04-20
  15. 14:23Claude Opus 4.1MMLU-Pro88.0%-2.99source ↗· unverified· dated 2026-04-20
  16. 14:23Qwen3.6 PlusMMLU-Pro88.5%-2.49source ↗· unverified· dated 2026-04-20
  17. 14:23Gemini 3 FlashMMLU-Pro89.0%-1.99source ↗· unverified· dated 2026-04-20
  18. 14:23Claude Opus 4.5MMLU-Pro89.5%-1.49source ↗· unverified· dated 2026-04-20
  19. 14:23Gemini 3 ProMMLU-Pro89.8%-1.19source ↗· unverified· dated 2026-04-20
Showing the 200 most-recent rows. To inspect a single dataset’s history, append ?dataset=ID (e.g. /log?dataset=mmmu). Delta compares each row to the prior-best value on the same dataset at the moment this row was added. Hidden datasets and hidden models are not shown.