The state of the art,
measured honestly.
Codesota is the open registry ML engineers consult before choosing a model — benchmarks linked to code, scores cross-checked against the paper, and original analysis of how the market actually uses these models. A calmer, stricter successor to Papers with Code.
No paywall. No signup. No sponsored leaderboards. Every result carries its source type — reproduced, paper, or vendor-reported — so you can decide how much to believe each number.
Current state of the art.
A transverse slice of the registry — the leading published score on each canonical benchmark, grouped by area. Shaded rows are independently verified by CodeSOTA; unshaded rows cite the paper or vendor.
- Results
- 988
- Models tracked
- 163
- Datasets indexed
- 97
- Research areas
- 17
| Area | Benchmark | Leading model | Metric | Score | Results |
|---|---|---|---|---|---|
| Code | HumanEval | o4-mini (high) | pass@1 | 99.3% | 33 |
| Code | SWE-bench Verified | Claude Opus 4.7 | resolve rate | 87.6% | 39 |
| Code | LiveCodeBench | DeepSeek-R1-0528 | pass@1 | 73.3% | 22 |
| Reasoning | MMLU-Pro | — | accuracy | — | 0 |
| Reasoning | GPQA Diamond | o3 | accuracy | 82.8% | 17 |
| Reasoning | Humanity's Last Exam | — | accuracy | — | 0 |
| Math | MATH | o4-mini (high) | accuracy | 98.2% | 29 |
| Math | AIME 2024 | o1-preview | accuracy | 83.3% | 3 |
| Math | GSM8K | o1-preview | accuracy | 97.8% | 5 |
| Vision | ImageNet-1K | coca-finetuned | top-1 | 91.0% | 22 |
| Vision | COCO detection | co-detr-swin-l | mAP | 66.0% | 17 |
| Vision | ADE20K | ONE-PEACE | mIoU | 63.0% | 13 |
| VQA | VQA-v2 | Qwen2-VL 72B | accuracy | 87.6% | 23 |
| VQA | TextVQA | Qwen2.5-VL 72B | accuracy | 85.5% | 9 |
| OCR | OCRBench v2 | Qwen2.5-VL-72B | overall | 63.70 | 74 |
| OCR | OmniDocBench | mineru-2.5 | layout mAP | 97.5% | 47 |
| OCR | ParseBench | LlamaParse Agentic | accuracy | 84.9% | 14 |
| OCR | OCR · CER | mistral-ocr-3 | CER (lower) | 3.7 | 1 |
| Speech | WildASR | Gemini 3 Pro | WER (lower) | 2.8 | 14 |
| Speech | VoiceBench | Ultravox-GLM-4P7 | overall | 88.9% | 13 |
| Audio | ESC-50 | BEATs | accuracy | 98.1% | 4 |
| Embeddings | MTEB | NV-Embed-v2 | avg | 72.3% | 6 |
The frontier climbs.
HumanEval — the oldest public code-generation benchmark — is nearing saturation. The step chart shows each successive SOTA-setting submission in the registry; the current leader is a reasoning-augmented mini model, not a frontier flagship.
Below, small multiples plot the real SOTA envelope across 7 modalities. Every point is a dated benchmark result from the registry; each step up is a submission that beat the running best. X-axis is calendar time.
Sixteen domains. One registry.
Everyone tracks frontier LLM scores. We also track what your pipeline depends on — OCR, ASR, detection, retrieval, embedded inference — with the same standard of evidence.
Frontier models on MMLU, GPQA, MATH, AIME.
HumanEval, SWE-bench Verified, LiveCodeBench.
Long-horizon autonomy, tool use, OpenRouter flow.
Layout, handwriting, table extraction.
WER on WildASR and industry splits.
Voice clarity, fingerprint robustness.
ImageNet, CIFAR, linear probe.
COCO, LVIS zero-shot, detection.
VQA-v2, TextVQA, chart reasoning.
MTEB avg, BEIR, hybrid retrieval.
ESC-50, AudioSet, sound event detection.
CheXpert, MIMIC-CXR, MedQA.
Habitat, LIBERO-Long, manipulation.
MVTec-AD, DAGM, NEU-DET.
Speed, cost and energy on real silicon.
LLMs on Hailo-10H, edge chip catalog.
What we have been writing.
Sections are written like issues: a leaderboard at the top, the methodology below it, then essays that explain what the numbers mean — not just what they are.
LLM reasoning
Frontier LLMs across MMLU, GPQA, MATH and AIME with verification notes.
OCR & documents
Layout, handwriting, structured extraction — with ParseBench write-up.
Speech
ASR accuracy, voice fingerprints, and a deep dive on TTS robustness.
Vision
Classification, detection, segmentation — priced against hardware.
Agentic AI
Long-horizon autonomy, tool use, OpenRouter market trends.
Every ML task
The alphabetical register with trust grades for every canonical benchmark.
Voice fingerprints
Why TTS benchmarks miss the acoustic fingerprint that actually matters.
Papers with Code
The successor project to the archived Meta registry.
Choosing a TTS model
A practitioner guide to speech synthesis trade-offs.
Trained something
that beats the table?
Submit a checkpoint or a paper result. We verify open-weight models against the public benchmark, cross-check vendor-reported numbers against the source, and add the row to the registry with its date and code trail.
If Codesota informed your research, please cite the registry. A citation helps other readers find reproduced, dated numbers — and helps us keep independent benchmarks sustainable.
@misc{codesota2026,
title = {Codesota: The Open Registry of State-of-the-Art Machine Learning},
author = {Wikiel, Kacper},
year = {2026},
url = {https://codesota.com},
note = {Accessed: 2026-04-25}
}@misc{codesota-registry2026,
title = {Codesota Benchmark Registry},
author = {Wikiel, Kacper},
year = {2026},
howpublished = {\url{https://codesota.com/data/benchmarks.json}},
note = {Open JSON registry of reproduced benchmark results}
}