Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Open JSON registrySource-tiered rowsSnapshot datesCorrection workflow
121 tasks by modality · 9,102 benchmark results

Pick a task.
Choose the best model.

CodeSOTA starts from the job you need done, then maps it to tasks, benchmarks, model pages, papers, and dated sources. Researchers find the frontier, technical leaders choose architectures, developers integrate current picks, and everyone else gets a readable path through the evidence.

Use search when you know a model, benchmark, paper, dataset, or short problem description. Use the modality map when you want to browse from a capability area.

Benchmark rows9,102dated registry entries
Models tracked163cross-task index
Datasets371benchmark sources
Capability areas9coverage map
§ 01 · Registry search

Describe the task. Get the model decision.

Query models, tasks, benchmarks, papers, and datasets in one place. The result should return a decision path: what to use, why it wins, what evidence backs it, and which snapshot your team can cite.

Example queries
01Describe task

Start from the job, not a generic leaderboard.

02Trace evidence

Inspect benchmarks, papers, source tiers, and dates.

03Ship decision

Use the API or cite the snapshot in your workflow.

§ 02 · Current frontier

One representative result per capability area.

A compact snapshot of the nine capability areas. The homepage only prints a model when the registry row is verified and has an inspectable source URL; otherwise the row stays pending instead of promoting a stale or weak claim.


Results
9,102
Models tracked
163
Datasets indexed
371
Capability areas
9
API schema →
Live registry snapshot2026-04-27Use API →
Representative evidence · inspect the task page
.jsonopen
CapabilityEvidenceTrusted modelMetricScoreSourceSnapshot
Language & KnowledgeMMLU-ProPending auditaccuracyPendingpending source2026-04-27
Vision & DocumentsOCRBench v2ovis2-5-9boverall63.40paperswithcode-public-api2026-05-18
Audio & SpeechWildASRPending auditWER (lower)Pendingpending source2026-04-27
Multimodal MediaVQA-v2Pending auditaccuracyPendingpending source2026-04-27
Code & Software EngineeringSWE-bench VerifiedClaude Opus 4.7resolve rate87.6%vendor2026-04-23
Agents & Tool UseGAIAPending auditaccuracyPendingpending source2026-04-27
Structured Data & ForecastingMTEBPending auditavgPendingpending source2026-04-27
Robotics, Control & RLAtari 2600Pending audithuman-normalized scorePendingpending source2026-04-27
Science, Medicine & IndustryMVTec-ADPending auditscorePendingpending source2026-04-27
Fig 2 · Each row shows one representative benchmark for orientation, not proof that a whole capability has one canonical test. Unverified rows, missing sources, and malformed source links are deliberately withheld from the first-page model claim. Scores are drawn from the open JSON at /data/benchmarks.json.
API mirror
curl https://www.codesota.com/api/sota/swe-bench
Docs
§ 03 · Capability map

Practical routes. Benchmarks as evidence.

The top-level map is a navigation layer, not a perfect ontology. Capabilities, modalities, and vertical domains stay linked through task pages, benchmark sets, datasets, models, papers, and evidence rows.

Fig 4 · Each tile is a route into the registry. Detailed pages can still separate capability, modality, domain, benchmark role, and trust flags without forcing all of that into the homepage.
§ 04 · Trust layer

A leaderboard row is not a fact until it can be inspected.

CodeSOTA is useful only if the evidence is visible. The homepage now surfaces the provenance contract before editorial lineages and release notes.

01

Dated scores

Rows carry access dates and snapshot context so old frontier claims do not masquerade as current facts.

02

Metric direction

Every benchmark declares whether higher or lower is better before a winner is selected.

03

Source tiers

Paper, vendor, reproduced, and registry-maintained rows are labeled separately.

04

Provenance trail

Benchmark pages connect model, paper, dataset, code, and source URL where available.

§ 04b · Vendor evals

Prove your model wins where buyers actually care.

Synthetic demos are not enough. CodeSOTA builds vendor-grade benchmark packages: private task suites, agent workflows, baseline comparisons, cost accounting, failure analysis, and a publishable evidence trail when the result is ready.

Real task dataSource-backed claimsReproducible runsProcurement-ready
§ 05 · API

The registry is callable.

Agents and notebooks should not scrape leaderboards. They should call a stable, source-aware endpoint and cache the snapshot they used.

Public JSONSnapshot IDsCORS-openTask-first
API docs
codesota/sota-api
curl https://www.codesota.com/api/sota/swe-bench
curl https://www.codesota.com/api/sota?area=vision-documents
{
  "task": "swe-bench",
  "metric": "resolve rate",
  "direction": "higher",
  "leader": {
    "model": "registry top pick",
    "score": "dated value",
    "source": "paper | vendor | reproduced",
    "snapshot_id": "2026-04-27"
  }
}
§ 10 · Register

Trained something
that beats the table?

Submit a checkpoint, paper result, or correction with structured benchmark provenance. We validate the score, cross-check the source, and add the row to the registry with its date and evidence trail.