Codesota · The Open RegistryHuman-readable pages · machine-readable SOTA APIIssue: April 27, 2026
9 capability areas · 121 tasks · 9,102 benchmark results

Find the current best AI model for a task
with evidence attached.

Benchmark, metric, source, date, and provenance attached. CodeSOTA maps models, papers, datasets, code, and scores into one dated, inspectable registry. Search it like a research index, or call /api/sota from an agent, notebook, or dashboard.

No paywall for reading or API access. Signup is optional for editing results, submitting evidence, and giving feedback. No sponsored leaderboards. Each result carries a source, snapshot date, metric direction, and provenance trail, so a benchmark claim can be inspected before it is reused.

Browse tasks Try /api/sotaSubmit scoreOpen JSON · snapshot 2026-04-27
§ 01 · Registry search

Search the registry before you trust a leaderboard.

Query models, tasks, benchmarks, papers, and datasets in one place. The answer should point to typed registry objects, not an unsourced marketing table.

Example queries
§ 02 · Current frontier

One representative result per capability area.

A compact snapshot of the nine capability areas: one canonical benchmark, one leading published score, and the source trail needed to inspect it. This is the human view of the same registry exposed through /api/sota.


Results
9,102
Models tracked
163
Datasets indexed
371
Capability areas
9
API schema →
Top score · canonical benchmark
.jsonopen
CapabilityBenchmarkLeading modelMetricScoreSourceSnapshot
Language & KnowledgeMMLU-ProNo trusted pick yetaccuracyPendingpending source2026-04-27
Vision & DocumentsOCRBench v2Qwen2.5-VL-72Boverall63.70codesota-api2026-04-20
Audio & SpeechWildASRGemini 3 ProWER (lower)2.8codesota-api2026-04-20
Multimodal MediaVQA-v2Qwen2-VL 72Baccuracy87.6%codesota-api2026-04-20
Code & Software EngineeringSWE-bench VerifiedClaude Opus 4.7resolve rate87.6%vendor2026-04-23
Agents & Tool UseGAIANo trusted pick yetaccuracyPendingpending source2026-04-27
Structured Data & ForecastingMTEBNV-Embed-v2avg72.31codesota-api2026-04-20
Robotics, Control & RLAtari 2600go-explorehuman-normalized score40,000 scorecodesota-api2026-04-20
Science, Medicine & IndustryMVTec-ADSimpleNetscore99.60codesota-api2026-04-20
Fig 2 · Each row shows the leading value on a canonical benchmark, higher- or lower-is-better declared in the metric label. Scores are drawn from the open JSON at /data/benchmarks.json.
API mirror
curl https://www.codesota.com/api/sota/swe-bench
Docs
§ 03 · Capability map

Nine capability areas. Hundreds of task pages.

The public taxonomy stays small enough to understand, while each capability opens into benchmark pages, datasets, models, papers, and evidence rows.

Capability
Language & Knowledge

Reasoning, exams, retrieval, and knowledge-heavy language tasks.

MMLU-Pro · GPQA · MTEB
Capability63.70
Vision & Documents

Images, detection, OCR, layout, tables, and document parsing.

COCO · OCRBench · OmniDocBench
Capability2.8
Audio & Speech

ASR, audio tagging, voice assistants, speech quality, and TTS.

WildASR · VoiceBench · ESC-50
Capability87.6%
Multimodal Media

VQA, charts, video, image-text reasoning, and media understanding.

VQA-v2 · TextVQA · MMMU
Capability87.6%
Code & Software Engineering

Code generation, repair, repository tasks, and verified software work.

HumanEval · LiveCodeBench · SWE-bench
Capability87.6%
Agents & Tool Use

Long-horizon tool use, browser work, OS tasks, and workflow execution.

GAIA · WebArena · OSWorld
Capability72.31
Structured Data & Forecasting

Embeddings, retrieval, reranking, tabular prediction, graphs, and forecasting.

MTEB · tabular · graph suites
Capability40,000 score
Robotics, Control & RL

Simulation, control, games, embodied agents, and manipulation.

Atari · Habitat · LIBERO
Capability99.60
Science, Medicine & Industry

Scientific QA, medical imaging, industrial inspection, and applied AI.

CheXpert · MVTec-AD · MedQA
Fig 4 · Each tile links into the registry while preserving the top-level capability map used by agents, navigation, and contribution flows.
§ 04 · Trust layer

A leaderboard row is not a fact until it can be inspected.

CodeSOTA is useful only if the evidence is visible. The homepage now surfaces the provenance contract before editorial lineages and release notes.

01

Dated scores

Rows carry access dates and snapshot context so old frontier claims do not masquerade as current facts.

02

Metric direction

Every benchmark declares whether higher or lower is better before a winner is selected.

03

Source tiers

Paper, vendor, reproduced, and registry-maintained rows are labeled separately.

04

Provenance trail

Benchmark pages connect model, paper, dataset, code, and source URL where available.

§ 05 · API

The registry is callable.

Agents and notebooks should not scrape leaderboards. They should call a stable, source-aware endpoint and cache the snapshot they used.

API docs
curl https://www.codesota.com/api/sota/swe-bench
curl https://www.codesota.com/api/sota?area=vision-documents
{
  "task": "swe-bench",
  "metric": "resolve rate",
  "direction": "higher",
  "leader": {
    "model": "registry top pick",
    "score": "dated value",
    "source": "paper | vendor | reproduced",
    "snapshot_id": "2026-04-27"
  }
}
§ 10 · Register

Trained something
that beats the table?

Submit a checkpoint, paper result, or correction with structured benchmark provenance. We validate the score, cross-check the source, and add the row to the registry with its date and evidence trail.