RL-environment indexSOTA registryOpen JSONDaily arXiv scan
The source of record for RL environments & SOTA · snapshot 2026-04-27

The data terminal for
RL environments & SOTA-with-code.

One aggregated, dated, source-tiered registry of the evals, RL environments, models, and papers that move the frontier — the place AI labs check to see which environments actually separate models, and the team that builds the verifiable-reward environments that do.

163
Models
371
Benchmarks
9,102
Results
32
RL envs
9
Capabilities
121
Tasks
§ 02 · Current frontier

One representative result per capability area.

A compact snapshot of the nine capability areas. The homepage only prints a model when the registry row is verified and has an inspectable source URL; otherwise the row stays pending instead of promoting a stale or weak claim.


Results
9,102
Models tracked
163
Datasets indexed
371
Capability areas
9
API schema →
Live registry snapshot2026-04-27Use API →
Representative evidence · inspect the task page
.jsonopen
CapabilityEvidenceTrusted modelMetricScoreSourceSnapshot
Language & KnowledgeMMLU-ProPending auditaccuracyPendingpending source2026-04-27
Vision & DocumentsOCRBench v2ovis2-5-9boverall63.40paperswithcode-public-api2026-05-18
Audio & SpeechWildASRPending auditWER (lower)Pendingpending source2026-04-27
Multimodal MediaVQA-v2Pending auditaccuracyPendingpending source2026-04-27
Code & Software EngineeringSWE-bench VerifiedClaude Opus 4.7resolve rate87.6%vendor2026-04-23
Agents & Tool UseGAIAPending auditaccuracyPendingpending source2026-04-27
Structured Data & ForecastingMTEBPending auditavgPendingpending source2026-04-27
Robotics, Control & RLAtari 2600Pending audithuman-normalized scorePendingpending source2026-04-27
Science, Medicine & IndustryMVTec-ADPending auditscorePendingpending source2026-04-27
Fig 2 · Each row shows one representative benchmark for orientation, not proof that a whole capability has one canonical test. Unverified rows, missing sources, and malformed source links are deliberately withheld from the first-page model claim. Scores are drawn from the open JSON at /data/benchmarks.json.
API mirror
curl https://www.codesota.com/api/sota/swe-bench
Docs
Cited & referenced by
Researchers and analysts
cite the registry.
Univ. of Surrey · AAAI 2026Tomasz Tunguz · Theory VenturesUseAIAPIAlternativeToHacker Newsr/MachineLearning
See all citations →
§ 03 · Capability map

Practical routes. Benchmarks as evidence.

The top-level map is a navigation layer, not a perfect ontology. Capabilities, modalities, and vertical domains stay linked through task pages, benchmark sets, datasets, models, papers, and evidence rows.

Fig 4 · Each tile is a route into the registry. Detailed pages can still separate capability, modality, domain, benchmark role, and trust flags without forcing all of that into the homepage.
§ 04 · Trust layer

A leaderboard row is not a fact until it can be inspected.

CodeSOTA is useful only if the evidence is visible. The homepage now surfaces the provenance contract before editorial lineages and release notes.

01

Dated scores

Rows carry access dates and snapshot context so old frontier claims do not masquerade as current facts.

02

Metric direction

Every benchmark declares whether higher or lower is better before a winner is selected.

03

Source tiers

Paper, vendor, reproduced, and registry-maintained rows are labeled separately.

04

Provenance trail

Benchmark pages connect model, paper, dataset, code, and source URL where available.

§ 04b · The RL-environment market

The neutral measure for the environments labs train on.

A market of RL-environment startups is selling to frontier labs — and the labs keep asking the same question: does this environment actually separate models, or is it saturated? CodeSOTA is the independent party that answers it. If you build environments, we certify yours discriminates. If you train models, we tell you which ones are worth the run.

Neutral third partySame lens on every envReproducible scoringLab-grade evidence
§ 05 · API

The registry is callable.

Agents and notebooks should not scrape leaderboards. They should call a stable, source-aware endpoint and cache the snapshot they used.

Public JSONSnapshot IDsCORS-openTask-first
API docs
codesota/sota-api
curl https://www.codesota.com/api/sota/swe-bench
curl https://www.codesota.com/api/sota?area=vision-documents
{
  "task": "swe-bench",
  "metric": "resolve rate",
  "direction": "higher",
  "leader": {
    "model": "registry top pick",
    "score": "dated value",
    "source": "paper | vendor | reproduced",
    "snapshot_id": "2026-04-27"
  }
}
§ 10 · Register

Trained something
that beats the table?

Submit a checkpoint, paper result, or correction with structured benchmark provenance. We validate the score, cross-check the source, and add the row to the registry with its date and evidence trail.