Find the current best AI model for a task
with evidence attached.
Benchmark, metric, source, date, and provenance attached. CodeSOTA maps models, papers, datasets, code, and scores into one dated, inspectable registry. Search it like a research index, or call /api/sota from an agent, notebook, or dashboard.
No paywall for reading or API access. Signup is optional for editing results, submitting evidence, and giving feedback. No sponsored leaderboards. Each result carries a source, snapshot date, metric direction, and provenance trail, so a benchmark claim can be inspected before it is reused.
Search the registry before you trust a leaderboard.
Query models, tasks, benchmarks, papers, and datasets in one place. The answer should point to typed registry objects, not an unsourced marketing table.
One representative result per capability area.
A compact snapshot of the nine capability areas: one canonical benchmark, one leading published score, and the source trail needed to inspect it. This is the human view of the same registry exposed through /api/sota.
- Results
- 9,102
- Models tracked
- 163
- Datasets indexed
- 371
- Capability areas
- 9
| Capability | Benchmark | Leading model | Metric | Score | Source | Snapshot |
|---|---|---|---|---|---|---|
| Language & Knowledge | MMLU-Pro | No trusted pick yet | accuracy | Pending | pending source | 2026-04-27 |
| Vision & Documents | OCRBench v2 | Qwen2.5-VL-72B | overall | 63.70 | codesota-api | 2026-04-20 |
| Audio & Speech | WildASR | Gemini 3 Pro | WER (lower) | 2.8 | codesota-api | 2026-04-20 |
| Multimodal Media | VQA-v2 | Qwen2-VL 72B | accuracy | 87.6% | codesota-api | 2026-04-20 |
| Code & Software Engineering | SWE-bench Verified | Claude Opus 4.7 | resolve rate | 87.6% | vendor | 2026-04-23 |
| Agents & Tool Use | GAIA | No trusted pick yet | accuracy | Pending | pending source | 2026-04-27 |
| Structured Data & Forecasting | MTEB | NV-Embed-v2 | avg | 72.31 | codesota-api | 2026-04-20 |
| Robotics, Control & RL | Atari 2600 | go-explore | human-normalized score | 40,000 score | codesota-api | 2026-04-20 |
| Science, Medicine & Industry | MVTec-AD | SimpleNet | score | 99.60 | codesota-api | 2026-04-20 |
curl https://www.codesota.com/api/sota/swe-benchNine capability areas. Hundreds of task pages.
The public taxonomy stays small enough to understand, while each capability opens into benchmark pages, datasets, models, papers, and evidence rows.
Reasoning, exams, retrieval, and knowledge-heavy language tasks.
Images, detection, OCR, layout, tables, and document parsing.
ASR, audio tagging, voice assistants, speech quality, and TTS.
VQA, charts, video, image-text reasoning, and media understanding.
Code generation, repair, repository tasks, and verified software work.
Long-horizon tool use, browser work, OS tasks, and workflow execution.
Embeddings, retrieval, reranking, tabular prediction, graphs, and forecasting.
Simulation, control, games, embodied agents, and manipulation.
Scientific QA, medical imaging, industrial inspection, and applied AI.
A leaderboard row is not a fact until it can be inspected.
CodeSOTA is useful only if the evidence is visible. The homepage now surfaces the provenance contract before editorial lineages and release notes.
Dated scores
Rows carry access dates and snapshot context so old frontier claims do not masquerade as current facts.
Metric direction
Every benchmark declares whether higher or lower is better before a winner is selected.
Source tiers
Paper, vendor, reproduced, and registry-maintained rows are labeled separately.
Provenance trail
Benchmark pages connect model, paper, dataset, code, and source URL where available.
The registry is callable.
Agents and notebooks should not scrape leaderboards. They should call a stable, source-aware endpoint and cache the snapshot they used.
API docscurl https://www.codesota.com/api/sota/swe-bench
curl https://www.codesota.com/api/sota?area=vision-documents{
"task": "swe-bench",
"metric": "resolve rate",
"direction": "higher",
"leader": {
"model": "registry top pick",
"score": "dated value",
"source": "paper | vendor | reproduced",
"snapshot_id": "2026-04-27"
}
}Trained something
that beats the table?
Submit a checkpoint, paper result, or correction with structured benchmark provenance. We validate the score, cross-check the source, and add the row to the registry with its date and evidence trail.