Codesota · Registry log9,080 rows · 7132 new this monthShowing 200
Editorial · Registry log
Every score we've added, in order.
The append-only public ledger of every benchmark result on Codesota. When a row was written, when the result itself is dated, who the model was, what value was claimed, and where the citation lives. New-SOTA rows are marked in colour; unverified rows still show, but labelled.
This is the audit trail. If a score is wrong, this is where the error will be visible; if a source is missing, this is where you'll see the gap.
2026-04-23 · 181 rows
- 20:40Gemini 3 FlashLiveCodeBench90.8%-0.90source ↗· verified· dated 2026-03-15
- 20:40Gemini 3 Pro PreviewLiveCodeBench91.7%NEW SOTA+6.70source ↗· verified· dated 2026-03-15
- 20:40Claude Opus 4.7SWE-Bench Verified87.6%NEW SOTA+6.70source ↗· verified· dated 2026-04-18
- 18:58Qwen3.6 PlusMMMU-Pro73.8%-8.20source ↗· verified· dated 2026-03-15
- 18:58GPT-5.1MMMU-Pro76.5%-5.50source ↗· verified· dated 2025-11-13
- 18:58Gemini 3 ProMMMU-Pro80.0%-2.00source ↗· verified· dated 2026-01-15
- 18:58GPT-5.2MMMU-Pro81.0%-1.00source ↗· verified· dated 2025-12-11
- 18:58Gemini 3.1 Pro PreviewMMMU-Pro82.0%NEW SOTAfirst resultsource ↗· verified· dated 2026-03-18
- 18:57Qwen3.5-27BMMMU82.3%-3.70source ↗· verified· dated 2025-09-01
- 18:57Qwen3.5-122B-A10BMMMU83.9%-2.10source ↗· verified· dated 2025-09-01
- 18:57Qwen3.5-397B-A17BMMMU83.9%-2.10source ↗· verified· dated 2025-09-01
- 18:57GPT-5.1MMMU85.4%-0.60source ↗· verified· dated 2025-11-13
- 18:57GPT-5.1 InstantMMMU85.4%-0.60source ↗· verified· dated 2025-11-13
- 18:57GPT-5.1 ThinkingMMMU85.4%-0.60source ↗· verified· dated 2025-11-13
- 18:57Qwen3.6 PlusMMMU86.0%NEW SOTA+12.70source ↗· verified· dated 2026-03-15
- 10:52GPT-4.5SWE-Bench62.0%-20.10no source· unverified· dated 2025-06-01
- 10:52Claude Opus 4SWE-Bench55.2%-26.90no source· unverified· dated 2025-03-01
- 10:52Claude 3.5 SonnetSWE-Bench49.0%-33.10no source· unverified· dated 2024-12-01
- 10:52GPT-4oSWE-Bench38.4%-43.70no source· unverified· dated 2024-11-01
- 10:52Amazon Q Developer AgentSWE-Bench36.2%-45.90no source· unverified· dated 2024-10-01
- 10:52Claude 3.5 SonnetSWE-Bench27.0%-55.10no source· unverified· dated 2024-08-01
- 10:52AutoCodeRoverSWE-Bench19.0%-63.10no source· unverified· dated 2024-06-01
- 10:52DevinSWE-Bench13.8%-68.30no source· unverified· dated 2024-05-01
- 10:52GPT-4SWE-Bench12.5%-69.60no source· unverified· dated 2024-03-01
- 10:52Claude 2SWE-Bench2.0%-80.14no source· unverified· dated 2023-10-01
- 10:52Qwen 3 72BSWE-Bench72.4%-9.70no source· unverified· dated 2025-10-01
- 10:52DeepSeek V3.5SWE-Bench74.6%-7.50no source· unverified· dated 2025-11-01
- 10:52Gemini 3 ProSWE-Bench76.2%-5.90no source· unverified· dated 2025-12-01
- 10:52Gemini 3 FlashSWE-Bench75.8%-6.30no source· unverified· dated 2026-02-01
- 10:52Claude Opus 4.5SWE-Bench76.8%-5.30no source· unverified· dated 2026-02-01
- 10:52Kimi K2.5SWE-Bench76.8%-5.30no source· unverified· dated 2026-01-01
- 10:52Claude Sonnet 4.5SWE-Bench77.2%-4.90no source· unverified· dated 2025-12-01
- 10:52Gemini 3 ProSWE-Bench77.4%-4.70source ↗· verified· dated 2026-01-01
- 10:52GLM-5SWE-Bench77.8%-4.30source ↗· verified· dated 2026-01-01
- 10:52Claude Opus 4.5SWE-Bench79.2%-2.90source ↗· verified· dated 2026-01-01
- 10:52Sonar FoundationSWE-Bench79.2%-2.90source ↗· verified· dated 2026-01-01
- 10:52GPT-5.2SWE-Bench80.0%-2.10source ↗· verified· dated 2026-02-01
- 10:52MiniMax M2.5SWE-Bench80.2%-1.90source ↗· verified· dated 2026-01-01
- 10:52Claude Opus 4.6SWE-Bench80.8%-1.30source ↗· verified· dated 2026-02-01
- 10:52Claude Opus 4.5SWE-Bench80.9%-1.20source ↗· verified· dated 2026-02-01
- 10:52BERT + AoASQuAD v2.088.6%-2.80source ↗· verified· dated 2019-03-01
- 10:52BERT (Google AI)SQuAD v2.083.1%-8.30source ↗· verified· dated 2018-11-01
- 10:52Logistic Regression (SQuAD baseline)SQuAD v2.051.0%-40.40source ↗· verified· dated 2016-06-01
- 10:52SLQA+ (single model)SQuAD v2.087.0%-4.38source ↗· verified· dated 2018-01-01
- 10:52Hanvon_model (single model)SQuAD v2.087.1%-4.28source ↗· verified· dated 2019-09-01
- 10:52Insight-baseline-BERT (single model)SQuAD v2.087.6%-3.76source ↗· verified· dated 2019-04-01
- 10:52XLNet (single, Verified XiaoPAI)SQuAD v2.088.0%-3.40source ↗· verified· dated 2019-09-01
- 10:52SpanBERT (single model)SQuAD v2.088.7%-2.69source ↗· verified· dated 2019-07-01
- 10:52BERT + DAE + AoA (single model)SQuAD v2.088.6%-2.78source ↗· verified· dated 2019-03-01
- 10:52XLNet+Verifier (single, Ping An)SQuAD v2.089.1%-2.34source ↗· verified· dated 2019-08-01
- 10:52XLNet+Verifier (single, Google/CMU)SQuAD v2.089.1%-2.32source ↗· verified· dated 2019-10-01
- 10:52BERT + ConvLSTM + MTL + Verifier (ensemble)SQuAD v2.089.3%-2.11source ↗· verified· dated 2019-03-01
- 10:52RoBERTa+Verify (single model)SQuAD v2.089.6%-1.81source ↗· verified· dated 2019-11-01
- 10:52Enhanced Albert+Verifier3 (ensemble)SQuAD v2.089.8%-1.62source ↗· verified· dated 2020-05-01
- 10:52RoBERTa (single model)SQuAD v2.089.8%-1.61source ↗· verified· dated 2020-07-01
- 10:51ERNIE 5.0GSM8K99.7%NEW SOTA+0.50source ↗· unverified· dated 2026-03-01
- 10:51GPT-5GSM8K99.2%NEW SOTA+0.20source ↗· unverified· dated 2025-08-01
- 10:51o1GSM8K97.8%-1.20source ↗· unverified· dated 2024-09-01
- 10:51GPT-4GSM8K92.0%-7.00source ↗· unverified· dated 2023-03-01
- 10:51PaLM 540B (Self-Consistency)GSM8K74.0%-25.00source ↗· unverified· dated 2022-01-01
- 10:51PaLM 540B (CoT)GSM8K58.0%-41.00source ↗· unverified· dated 2022-01-01
- 10:51GPT-3 (base)GSM8K8.0%-91.00source ↗· unverified· dated 2021-11-01
- 10:51Mixtral-8x22bGSM8K88.0%-11.00source ↗· unverified· dated 2024-04-01
- 10:51Claude 3 HaikuGSM8K88.9%-10.10source ↗· unverified· dated 2024-03-01
- 10:51GPT-4GSM8K92.0%-7.00source ↗· unverified· dated 2023-03-01
- 10:51Gemini UltraGSM8K94.4%-4.60source ↗· unverified· dated 2024-02-01
- 10:51Claude 3 OpusGSM8K95.0%-4.00source ↗· unverified· dated 2024-03-01
- 10:51Claude 3.5 SonnetGSM8K95.0%-4.00source ↗· unverified· dated 2024-07-01
- 10:51o1GSM8K97.8%-1.20source ↗· unverified· dated 2024-09-01
- 10:51GPT-4.5GSM8K98.2%-0.80source ↗· unverified· dated 2025-03-01
- 10:51Llama 4 Behemoth 2TGSM8K98.5%-0.50source ↗· unverified· dated 2025-04-01
- 10:51Claude 4GSM8K98.9%-0.10source ↗· unverified· dated 2025-05-01
- 10:51Claude Sonnet 5SWE-Bench82.1%=0.0source ↗· verified· dated 2026-02-01
- 10:51MiniMax M2.5SWE-Bench80.2%-1.90source ↗· verified· dated 2026-01-01
- 10:51Claude Opus 4.5SWE-Bench78.0%-4.10source ↗· verified· dated 2025-12-01
- 10:51Claude Sonnet 4.5SWE-Bench70.8%-11.30source ↗· verified· dated 2025-09-01
- 10:51GPT-4.5SWE-Bench62.0%-20.10source ↗· verified· dated 2025-06-01
- 10:51Claude Opus 4SWE-Bench55.2%-26.90source ↗· verified· dated 2025-03-01
- 10:51Claude 3.5 Sonnet v2SWE-Bench49.0%-33.10source ↗· verified· dated 2024-12-01
- 10:51o1-previewSWE-Bench36.2%-45.90source ↗· verified· dated 2024-10-01
- 10:51Claude 3.5 SonnetSWE-Bench27.0%-55.10source ↗· verified· dated 2024-08-01
- 10:51GPT-4oSWE-Bench19.0%-63.10source ↗· verified· dated 2024-06-01
- 10:51GPT-4 TurboSWE-Bench12.5%-69.60source ↗· verified· dated 2024-03-01
- 10:51Claude 2SWE-Bench2.0%-80.14source ↗· verified· dated 2023-10-01
- 10:51DeepSeek-Coder 33BSWE-Bench15.6%-66.50source ↗· verified· dated 2024-06-01
- 10:51StarCoder2 15BSWE-Bench18.3%-63.80source ↗· verified· dated 2024-10-01
- 10:51CodeLlama 70BSWE-Bench29.8%-52.30source ↗· verified· dated 2024-12-01
- 10:51Qwen2.5-Coder 32BSWE-Bench55.4%-26.70source ↗· verified· dated 2025-06-01
- 10:51DeepSeek-Coder V2.5SWE-Bench68.2%-13.90source ↗· verified· dated 2025-08-01
- 10:51Qwen3 72BSWE-Bench72.4%-9.70source ↗· verified· dated 2025-10-01
- 10:51Step-3.5-FlashSWE-Bench74.4%-7.70source ↗· verified· dated 2026-01-01
- 10:51DeepSeek V3.5SWE-Bench74.6%-7.50source ↗· verified· dated 2025-11-01
- 10:51Qwen3-Max-ThinkingSWE-Bench75.3%-6.80source ↗· verified· dated 2026-02-01
- 10:51Gemini 3 FlashSWE-Bench75.8%-6.30source ↗· verified· dated 2026-02-01
- 10:51DeepSeek R1SWE-Bench76.3%-5.80source ↗· verified· dated 2025-12-01
- 10:51Kimi K2.5SWE-Bench76.8%-5.30source ↗· verified· dated 2026-01-01
- 10:51Claude Sonnet 4.5SWE-Bench77.2%-4.90source ↗· verified· dated 2025-12-01
- 10:51Gemini 3 ProSWE-Bench77.4%-4.70source ↗· verified· dated 2026-01-01
- 10:51GLM-5SWE-Bench77.8%-4.30source ↗· verified· dated 2026-01-01
- 10:51Claude Opus 4.6SWE-Bench79.8%-2.30source ↗· verified· dated 2026-02-01
- 10:51GPT-5.2SWE-Bench80.0%-2.10source ↗· verified· dated 2026-02-01
- 10:51MiniMax M2.5SWE-Bench80.2%-1.90source ↗· verified· dated 2026-01-01
- 10:51Claude Opus 4.5SWE-Bench80.9%-1.20source ↗· verified· dated 2026-02-01
- 10:51Claude Sonnet 5SWE-Bench82.1%NEW SOTAfirst resultsource ↗· verified· dated 2026-02-01
- 10:51Phi-4 14BMMLU83.9%-9.00source ↗· unverified· dated 2025-08-01
- 10:51Qwen 3 14BMMLU84.3%-8.60source ↗· unverified· dated 2025-11-01
- 10:51Kimi K2.5MMLU86.0%-6.90source ↗· unverified· dated 2025-12-01
- 10:51MiniMax M2.5MMLU86.5%-6.40source ↗· unverified· dated 2026-01-01
- 10:51Mistral Large 3MMLU87.1%-5.80source ↗· unverified· dated 2025-10-01
- 10:51Llama 4 405BMMLU87.8%-5.10source ↗· unverified· dated 2025-09-01
- 10:51DeepSeek V3.5MMLU88.2%-4.70source ↗· unverified· dated 2025-10-01
- 10:51Qwen 3 72BMMLU88.7%-4.20source ↗· unverified· dated 2025-11-01
- 10:51Gemini 3 FlashMMLU89.6%-3.30source ↗· unverified· dated 2026-01-01
- 10:51Claude Sonnet 4.5MMLU90.4%-2.50source ↗· unverified· dated 2025-12-01
- 10:51GPT-5MMLU90.8%-2.10source ↗· unverified· dated 2025-09-01
- 10:51Claude Opus 4.6MMLU91.2%-1.70source ↗· unverified· dated 2026-03-01
- 10:51Gemini 3 ProMMLU91.4%-1.50source ↗· unverified· dated 2026-01-01
- 10:51Claude Opus 4.5MMLU91.8%-1.10source ↗· unverified· dated 2026-01-01
- 10:51GPT-5.2MMLU92.4%-0.50source ↗· unverified· dated 2026-02-01
- 10:51SENetImageNet97.8%NEW SOTA+1.32source ↗· verified· dated 2017-01-01
- 10:51ResNet-152ImageNet96.4%NEW SOTA+3.13source ↗· verified· dated 2015-01-01
- 10:51GoogLeNetImageNet93.3%NEW SOTA+2.30source ↗· verified· dated 2014-01-01
- 10:51AlexNetImageNet83.6%-7.40source ↗· verified· dated 2012-01-01
- 10:51NEC-UIUCImageNet71.8%-19.20source ↗· verified· dated 2010-01-01
- 10:51convnext_base.fb_in22k_ft_in1kImageNet86.3%-4.70source ↗· verified· dated 2022-01-01
- 10:51swin_large.ms_in22k_ft_in1kImageNet86.3%-4.67source ↗· verified· dated 2021-03-01
- 10:51nextvit_large.bd_ssld_6m_in1k_384ImageNet86.5%-4.46source ↗· verified· dated 2022-11-01
- 10:51coatnet_2_rw_224.sw_in12k_ft_in1kImageNet86.6%-4.42source ↗· verified· dated 2022-09-01
- 10:51maxvit_base_tf_512.in1kImageNet86.6%-4.40source ↗· verified· dated 2023-04-01
- 10:51InternViT-6B (InternVL)ImageNet88.2%-2.80source ↗· verified· dated 2024-06-01
- 10:51ViT-22B/14ImageNet89.5%-1.49source ↗· verified· dated 2023-02-01
- 10:51EVA-02 (ViT-L/14+)ImageNet90.0%-1.00source ↗· verified· dated 2023-03-01
- 10:51SoViT-400M/14ImageNet90.3%-0.70source ↗· verified· dated 2023-05-01
- 10:51CoCa (ViT-G/14)ImageNet91.0%NEW SOTAfirst resultsource ↗· verified· dated 2022-05-01
- 10:51DETRCOCO43.3%-22.82source ↗· unverified· dated 2020-05-26
- 10:51Mask R-CNNCOCO39.8%-26.32source ↗· unverified· dated 2017-03-20
- 10:51Faster R-CNNCOCO37.4%-28.72source ↗· unverified· dated 2015-06-04
- 10:51Swin-L (Cascade R-CNN)COCO58.9%-7.22source ↗· unverified· dated 2021-07-01
- 10:51ViT-Adapter-LCOCO60.5%-5.62source ↗· unverified· dated 2022-11-01
- 10:51DINO-ViT-LCOCO63.3%-2.82source ↗· unverified· dated 2023-03-01
- 10:51InternImage-H (OneFormer)COCO65.5%-0.62source ↗· unverified· dated 2024-03-01
- 10:51ThinkerCOCO66.0%-0.12source ↗· unverified· dated 2024-08-01
- 10:51SenseTime BasemodelCOCO66.0%-0.12source ↗· unverified· dated 2024-11-01
- 10:51CW_DetectionCOCO66.0%-0.12source ↗· unverified· dated 2025-01-01
- 10:51ScyllaNetCOCO66.1%NEW SOTA+0.12source ↗· unverified· dated 2025-09-01
- 10:51T5-11BGLUE89.3%-2.00source ↗· verified· dated 2019-10-01
- 10:51DeBERTa (ensemble)GLUE90.3%-1.00source ↗· verified· dated 2021-01-01
- 10:51ERNIE 3.0GLUE90.6%-0.70source ↗· verified· dated 2021-07-01
- 10:51ST-MoE-32BGLUE91.2%-0.10source ↗· verified· dated 2022-02-01
- 10:51Vega v2 (6B)GLUE91.3%NEW SOTAfirst resultsource ↗· verified· dated 2022-10-01
- 10:51Qianfan-OCROmniDocBench91.0%-6.48source ↗· unverified
- 10:51Qianfan-OCROmniDocBench92.4%-5.07source ↗· unverified
- 10:51Qianfan-OCROmniDocBench0.0%-97.46source ↗· unverified
- 10:51Qianfan-OCROmniDocBench93.1%-4.38source ↗· unverified
- 10:51GPT-4oOmniDocBench75.0%-22.48source ↗· unverified
- 10:51Dolphin-1.5OmniDocBench85.1%-12.44source ↗· unverified
- 10:51Dolphin-v2OmniDocBench89.8%-7.72source ↗· unverified
- 10:51clearOCROmniDocBench31.7%-65.80source ↗· verified
- 10:51PaddleOCR-VLOmniDocBench92.9%-4.64source ↗· unverified
- 10:51mistral-ocr-2512OmniDocBench79.8%-17.75source ↗· verified
- 10:51Mistral OCR 3OmniDocBench79.8%-17.75source ↗· verified
- 10:51dots.ocr 3BOmniDocBench88.4%-9.09source ↗· unverified
- 10:51OCRVerse 4BOmniDocBench88.6%-8.94source ↗· unverified
- 10:51Qwen2.5-VLOmniDocBench87.0%-10.48source ↗· unverified
- 10:51Gemini 2.5 ProOmniDocBench88.0%-9.47source ↗· unverified
- 10:51MonkeyOCR-pro-3BOmniDocBench88.8%-8.65source ↗· unverified
- 10:51Qwen3-VL-235BOmniDocBench89.2%-8.35source ↗· unverified
- 10:51MinerU 2.5OmniDocBench90.7%-6.83source ↗· unverified
- 10:51PaddleOCR-VL 0.9BOmniDocBench92.6%-4.94source ↗· unverified
- 10:51Codex (davinci-002)HumanEval46.9%-50.40source ↗· verified· dated 2021-07-01
- 10:51DeepSeek-Coder-33B-InstructHumanEval79.3%-18.00source ↗· verified· dated 2023-11-01
- 10:51Codestral 25.01HumanEval85.3%-12.00source ↗· verified· dated 2025-01-01
- 10:51GPT-4 TurboHumanEval86.6%-10.70source ↗· verified· dated 2023-11-01
- 10:51Llama-3.3-70B-InstructHumanEval88.4%-8.90source ↗· verified· dated 2024-12-01
- 10:51GPT-4oHumanEval90.2%-7.10source ↗· verified· dated 2024-05-01
- 10:51DeepSeek-Coder-V2-InstructHumanEval90.2%-7.10source ↗· verified· dated 2024-06-01
- 10:51Qwen2.5-Coder 32BHumanEval92.7%-4.60source ↗· verified· dated 2025-03-01
- 10:51Claude Sonnet 4.6HumanEval94.1%-3.20source ↗· verified· dated 2026-01-01
- 10:51o3HumanEval94.8%-2.50source ↗· verified· dated 2025-04-01
- 10:51GPT-5HumanEval95.1%-2.20source ↗· verified· dated 2025-12-01
- 10:51Claude Opus 4.6HumanEval96.3%-1.00source ↗· verified· dated 2026-01-01
2026-04-20 · 19 rows
- 14:23GPT-4oMMLU-Pro72.6%-18.39source ↗· unverified· dated 2026-04-20
- 14:23Claude 3.7 SonnetMMLU-Pro85.1%-5.89source ↗· unverified· dated 2026-04-20
- 14:23DeepSeek-R1-0528MMLU-Pro85.0%-5.99source ↗· unverified· dated 2026-04-20
- 14:23Kimi K2-Thinking-0905MMLU-Pro84.6%-6.39source ↗· unverified· dated 2026-04-20
- 14:23GLM-4.5MMLU-Pro84.6%-6.39source ↗· unverified· dated 2026-04-20
- 14:23DeepSeek V3.2MMLU-Pro86.2%-4.79source ↗· unverified· dated 2026-04-20
- 14:23Grok 4MMLU-Pro86.6%-4.39source ↗· unverified· dated 2026-04-20
- 14:23GPT-5.1MMLU-Pro87.0%-3.99source ↗· unverified· dated 2026-04-20
- 14:23GPT-5MMLU-Pro87.1%-3.89source ↗· unverified· dated 2026-04-20
- 14:23Kimi K2.5MMLU-Pro87.1%-3.89source ↗· unverified· dated 2026-04-20
- 14:23GPT-5.2MMLU-Pro87.4%-3.59source ↗· unverified· dated 2026-04-20
- 14:23Claude Sonnet 4.5MMLU-Pro87.5%-3.49source ↗· unverified· dated 2026-04-20
- 14:23Qwen3.5-397B-A17BMMLU-Pro87.8%-3.19source ↗· unverified· dated 2026-04-20
- 14:23MiniMax M2.1MMLU-Pro88.0%-2.99source ↗· unverified· dated 2026-04-20
- 14:23Claude Opus 4.1MMLU-Pro88.0%-2.99source ↗· unverified· dated 2026-04-20
- 14:23Qwen3.6 PlusMMLU-Pro88.5%-2.49source ↗· unverified· dated 2026-04-20
- 14:23Gemini 3 FlashMMLU-Pro89.0%-1.99source ↗· unverified· dated 2026-04-20
- 14:23Claude Opus 4.5MMLU-Pro89.5%-1.49source ↗· unverified· dated 2026-04-20
- 14:23Gemini 3 ProMMLU-Pro89.8%-1.19source ↗· unverified· dated 2026-04-20
Showing the 200 most-recent rows. To inspect a single dataset’s history, append ?dataset=ID (e.g. /log?dataset=mmmu). Delta compares each row to the prior-best value on the same dataset at the moment this row was added. Hidden datasets and hidden models are not shown.