Pick a task.
Choose the best model.
CodeSOTA starts from the job you need done, then maps it to tasks, benchmarks, model pages, papers, and dated sources. Researchers find the frontier, technical leaders choose architectures, developers integrate current picks, and everyone else gets a readable path through the evidence.
Use search when you know a model, benchmark, paper, dataset, or short problem description. Use the modality map when you want to browse from a capability area.
Describe the task. Get the model decision.
Query models, tasks, benchmarks, papers, and datasets in one place. The result should return a decision path: what to use, why it wins, what evidence backs it, and which snapshot your team can cite.
Start from the job, not a generic leaderboard.
Inspect benchmarks, papers, source tiers, and dates.
Use the API or cite the snapshot in your workflow.
One representative result per capability area.
A compact snapshot of the nine capability areas. The homepage only prints a model when the registry row is verified and has an inspectable source URL; otherwise the row stays pending instead of promoting a stale or weak claim.
- Results
- 9,102
- Models tracked
- 163
- Datasets indexed
- 371
- Capability areas
- 9
| Capability | Evidence | Trusted model | Metric | Score | Source | Snapshot |
|---|---|---|---|---|---|---|
| Language & Knowledge | MMLU-Pro | Pending audit | accuracy | Pending | pending source | 2026-04-27 |
| Vision & Documents | OCRBench v2 | ovis2-5-9b | overall | 63.40 | paperswithcode-public-api | 2026-05-18 |
| Audio & Speech | WildASR | Pending audit | WER (lower) | Pending | pending source | 2026-04-27 |
| Multimodal Media | VQA-v2 | Pending audit | accuracy | Pending | pending source | 2026-04-27 |
| Code & Software Engineering | SWE-bench Verified | Claude Opus 4.7 | resolve rate | 87.6% | vendor | 2026-04-23 |
| Agents & Tool Use | GAIA | Pending audit | accuracy | Pending | pending source | 2026-04-27 |
| Structured Data & Forecasting | MTEB | Pending audit | avg | Pending | pending source | 2026-04-27 |
| Robotics, Control & RL | Atari 2600 | Pending audit | human-normalized score | Pending | pending source | 2026-04-27 |
| Science, Medicine & Industry | MVTec-AD | Pending audit | score | Pending | pending source | 2026-04-27 |
curl https://www.codesota.com/api/sota/swe-benchPractical routes. Benchmarks as evidence.
The top-level map is a navigation layer, not a perfect ontology. Capabilities, modalities, and vertical domains stay linked through task pages, benchmark sets, datasets, models, papers, and evidence rows.
Reasoning, exams, retrieval, and knowledge-heavy language tasks.
Images, detection, OCR, layout, tables, and document parsing.
ASR, audio tagging, voice assistants, speech quality, and TTS.
VQA, charts, video, image-text reasoning, and media understanding.
Code generation, repair, repository tasks, and verified software work.
Long-horizon tool use, browser work, OS tasks, and workflow execution.
Embeddings, retrieval, reranking, tabular prediction, graphs, and forecasting.
Simulation, control, games, embodied agents, and manipulation.
Scientific QA, medical imaging, industrial inspection, and applied AI.
A leaderboard row is not a fact until it can be inspected.
CodeSOTA is useful only if the evidence is visible. The homepage now surfaces the provenance contract before editorial lineages and release notes.
Dated scores
Rows carry access dates and snapshot context so old frontier claims do not masquerade as current facts.
Metric direction
Every benchmark declares whether higher or lower is better before a winner is selected.
Source tiers
Paper, vendor, reproduced, and registry-maintained rows are labeled separately.
Provenance trail
Benchmark pages connect model, paper, dataset, code, and source URL where available.
Prove your model wins where buyers actually care.
Synthetic demos are not enough. CodeSOTA builds vendor-grade benchmark packages: private task suites, agent workflows, baseline comparisons, cost accounting, failure analysis, and a publishable evidence trail when the result is ready.
The registry is callable.
Agents and notebooks should not scrape leaderboards. They should call a stable, source-aware endpoint and cache the snapshot they used.
curl https://www.codesota.com/api/sota/swe-bench
curl https://www.codesota.com/api/sota?area=vision-documents{
"task": "swe-bench",
"metric": "resolve rate",
"direction": "higher",
"leader": {
"model": "registry top pick",
"score": "dated value",
"source": "paper | vendor | reproduced",
"snapshot_id": "2026-04-27"
}
}Trained something
that beats the table?
Submit a checkpoint, paper result, or correction with structured benchmark provenance. We validate the score, cross-check the source, and add the row to the registry with its date and evidence trail.