The data terminal for
RL environments & SOTA-with-code.
One aggregated, dated, source-tiered registry of the evals, RL environments, models, and papers that move the frontier — the place AI labs check to see which environments actually separate models, and the team that builds the verifiable-reward environments that do.
Three ways in. All backed by the same evidence.
Whether you are choosing what to cite, deciding what to train on, or tracking the frontier, you land on the same dated, source-tiered registry underneath.
Browse benchmarks by what they prove.
Every eval with status, saturation, and lift evidence — so you cite numbers that still separate models, not dead ones.
Get the environments that lift it.
Pick a capability gap; get RL environments and datasets ranked by discriminative power — the ones most likely to move a strong model.
The frontier feed.
Latest evals, environments, models, and papers — chronological, dated, and linked back to the registry rows they touch.
One representative result per capability area.
A compact snapshot of the nine capability areas. The homepage only prints a model when the registry row is verified and has an inspectable source URL; otherwise the row stays pending instead of promoting a stale or weak claim.
- Results
- 9,102
- Models tracked
- 163
- Datasets indexed
- 371
- Capability areas
- 9
| Capability | Evidence | Trusted model | Metric | Score | Source | Snapshot |
|---|---|---|---|---|---|---|
| Language & Knowledge | MMLU-Pro | Pending audit | accuracy | Pending | pending source | 2026-04-27 |
| Vision & Documents | OCRBench v2 | ovis2-5-9b | overall | 63.40 | paperswithcode-public-api | 2026-05-18 |
| Audio & Speech | WildASR | Pending audit | WER (lower) | Pending | pending source | 2026-04-27 |
| Multimodal Media | VQA-v2 | Pending audit | accuracy | Pending | pending source | 2026-04-27 |
| Code & Software Engineering | SWE-bench Verified | Claude Opus 4.7 | resolve rate | 87.6% | vendor | 2026-04-23 |
| Agents & Tool Use | GAIA | Pending audit | accuracy | Pending | pending source | 2026-04-27 |
| Structured Data & Forecasting | MTEB | Pending audit | avg | Pending | pending source | 2026-04-27 |
| Robotics, Control & RL | Atari 2600 | Pending audit | human-normalized score | Pending | pending source | 2026-04-27 |
| Science, Medicine & Industry | MVTec-AD | Pending audit | score | Pending | pending source | 2026-04-27 |
curl https://www.codesota.com/api/sota/swe-benchcite the registry.
Practical routes. Benchmarks as evidence.
The top-level map is a navigation layer, not a perfect ontology. Capabilities, modalities, and vertical domains stay linked through task pages, benchmark sets, datasets, models, papers, and evidence rows.
Reasoning, exams, retrieval, and knowledge-heavy language tasks.
Images, detection, OCR, layout, tables, and document parsing.
ASR, audio tagging, voice assistants, speech quality, and TTS.
VQA, charts, video, image-text reasoning, and media understanding.
Code generation, repair, repository tasks, and verified software work.
Long-horizon tool use, browser work, OS tasks, and workflow execution.
Embeddings, retrieval, reranking, tabular prediction, graphs, and forecasting.
Simulation, control, games, embodied agents, and manipulation.
Scientific QA, medical imaging, industrial inspection, and applied AI.
A leaderboard row is not a fact until it can be inspected.
CodeSOTA is useful only if the evidence is visible. The homepage now surfaces the provenance contract before editorial lineages and release notes.
Dated scores
Rows carry access dates and snapshot context so old frontier claims do not masquerade as current facts.
Metric direction
Every benchmark declares whether higher or lower is better before a winner is selected.
Source tiers
Paper, vendor, reproduced, and registry-maintained rows are labeled separately.
Provenance trail
Benchmark pages connect model, paper, dataset, code, and source URL where available.
The neutral measure for the environments labs train on.
A market of RL-environment startups is selling to frontier labs — and the labs keep asking the same question: does this environment actually separate models, or is it saturated? CodeSOTA is the independent party that answers it. If you build environments, we certify yours discriminates. If you train models, we tell you which ones are worth the run.
The registry is callable.
Agents and notebooks should not scrape leaderboards. They should call a stable, source-aware endpoint and cache the snapshot they used.
curl https://www.codesota.com/api/sota/swe-bench
curl https://www.codesota.com/api/sota?area=vision-documents{
"task": "swe-bench",
"metric": "resolve rate",
"direction": "higher",
"leader": {
"model": "registry top pick",
"score": "dated value",
"source": "paper | vendor | reproduced",
"snapshot_id": "2026-04-27"
}
}Trained something
that beats the table?
Submit a checkpoint, paper result, or correction with structured benchmark provenance. We validate the score, cross-check the source, and add the row to the registry with its date and evidence trail.