Codesota · OntologyAreas → tasks → benchmarks → resultsIssue: April 27, 2026

Editorial · Ontology

How Codesota
models evidence.

Codesota is not just a list of benchmarks. It is an ontology for machine-learning evidence: capability areas contain tasks, tasks contain benchmark protocols, benchmarks define datasets, metrics, splits, prompts, constraints, and aggregation rules, and every result points back to a paper, vendor report, reproduction, or correction.

This page explains the object model behind the website, the score submission flow, and the /api/sota endpoint.

See /api/sota →Submit a score Methodology

§ 01 · Entities

The nouns in the registry.

Each entity exists to prevent a common benchmark error: mixing tasks with datasets, treating datasets as complete benchmark protocols, ranking numbers without metric direction, or quoting a score with no source.

Tasks

Capability area

id shape: vision-documents

A stable top-level capability group. Areas keep navigation readable without mixing modalities, methods, domains, and benchmark families.

Example · Language & Knowledge, Vision & Documents, Agents & Tool Use

Benchmarks

Task

id shape: document-ocr

The capability being evaluated. Tasks are stable enough for APIs and aliases, even when individual benchmarks age out.

Example · Document OCR, code generation, visual question answering

Protocol

Benchmark

id shape: omnidocbench

The evaluation protocol: one or more datasets, splits, metrics, prompts, constraints, aggregation rules, and versioned scoring instructions.

Example · OmniDocBench, LiveCodeBench, SWE-bench Verified, MTEB

Examples and splits

Dataset

id shape: kitab-bench-data

The raw or curated evaluation data used by a benchmark. A dataset can be reused by several benchmarks, and a benchmark can combine several datasets.

Example · ImageNet, KITAB, DocVQA, MMLU-Pro, COCO

Direction and unit

Metric

id shape: pass@1

The scoring function and comparison direction. Metrics can be generic, task-specific, or benchmark-specific.

Example · pass@1, resolve rate, CER, mAP, exact match, MTEB avg score

Result rows

Model

id shape: paddleocr-vl-1.5

A model, checkpoint, API release, agent scaffold, or system being evaluated. Models are canonicalized before rows are ranked.

Example · GPT-5, Claude Opus 4.7, PaddleOCR-VL, Qwen3

Provenance

Result

id shape: benchmark + model + metric + date

The atomic fact in Codesota: one model or system, one benchmark protocol, one metric value, one date, one source trail.

Example · Claude Opus 4.7 on SWE-bench Verified, resolve rate, dated source

Evidence

Paper / source

id shape: arxiv or vendor URL

The citation, reproduction package, vendor page, benchmark report, or correction note that justifies a result row.

Example · arXiv paper, GitHub reproduction, official leaderboard, vendor report

§ 02 · Relationships

The graph is small on purpose.

The core hierarchy is strict. Evidence objects attach to it. That keeps pages, APIs, and contribution review aligned.

Capability area

has many →

Tasks

Task

has many →

Benchmarks

Benchmark

uses →

Datasets

Benchmark

defines →

Metrics and protocol

Benchmark

has many →

Results

Model

has many →

Results

Paper / source

supports →

Models and results

Result

selects →

SOTA pick per benchmark

Lineage

orders →

Benchmarks over time

Submission

proposes →

New or corrected results

§ 03 · Lifecycle

From claim to registry row.

Ontology is what lets Codesota accept messy real-world evidence while publishing stable, inspectable objects.

Extract

A paper, leaderboard, vendor post, or community submission names models, benchmarks, datasets, metrics, protocols, and scores.

Canonicalize

Messy names are mapped to existing IDs: benchmark aliases resolve to protocols, dataset aliases resolve to data objects, models are de-duplicated, and task ownership is checked.

Validate

Metric direction, result date, source type, hidden status, protocol version, and benchmark comparability are reviewed before a row can influence a leaderboard.

Rank

The best valid row becomes the current SOTA for a dataset or task. /api/sota exposes that pick with runners-up and provenance.

Revise

Corrections, contamination flags, lineage changes, and new benchmark successors are appended rather than silently rewriting history.

§ 04 · Live snapshot

The hierarchy in practice.

These are the highest-coverage public areas right now, with a few tasks and datasets shown under each. Until the schema migration lands, these benchmark protocols are read from legacy dataset rows. The full browser remains at /browse.

Natural Language Processing

16 tasks · 27 benchmark rows · 7436 results

Polish LLM General

Open PL LLM Leaderboard · 5100

Polish Cultural Competency

PLCC · 1155

Polish Text Understanding

CPTU-Bench · 465

Polish Conversation Quality

Polish MT-Bench · 450

Computer Vision

23 tasks · 188 benchmark rows · 2120 results

Optical Character Recognition

cnn-/-daily-mail · 101 scut-ctw1500 · 82 e2e · 45 icdar2013 · 39 inverse-text · 34

Scene Text Detection

ICDAR 2015 · 188 Total-Text · 126 msra-td500 · 79 icdar-2013 · 59 icdar-2017-mlt · 54

Document Parsing

olmOCR-Bench · 74 OmniDocBench · 61 ParseBench · 14

Document Layout Analysis

publaynet-val · 92 document-layout-recognition-challenge-test · 18 document-layout-recognition-challenge-mini-dev · 12 u-diads-bib · 8 d4la · 3

Speech

3 tasks · 13 benchmark rows · 532 results

Speech Recognition

LibriSpeech · 111 Open ASR Leaderboard · 102 SPGISpeech · 56 VoxPopuli · 55 AMI-IHM · 50

Speaker Verification

VoxCeleb1-O · 3

Speech Translation

MuST-C En-De tst-COMMON · 3

Reasoning

5 tasks · 20 benchmark rows · 486 results

Commonsense Reasoning

MMLU-Pro · 71 MMLU · 64 HellaSwag · 17 WinoGrande · 13 ARC-Challenge · 10

Multi-step Reasoning

GPQA Diamond · 74 HLE · 74 BIG-Bench Hard · 11 StrategyQA · 2

Mathematical Reasoning

GSM8K · 48 MATH · 46 AIME 2025 · 22 AIME 2024 · 11

Logical Reasoning

ARC-AGI-1 · 5 ARC-AGI-2 · 3 LogiQA · 2 ReClor · 2

Computer Code

6 tasks · 15 benchmark rows · 297 results

Code Generation

SWE-Bench Verified · 61 SWE-bench · 57 LiveCodeBench · 54 HumanEval · 45 MBPP · 22

Code Translation

TransCoder (GeeksForGeeks) · 7

Multimodal

10 tasks · 26 benchmark rows · 267 results

Visual Question Answering

MMMU-Pro · 31 MMMU · 30 RealWorldQA · 23 TextVQA · 23 MMBench · 20

Image-Text-to-Text

MMMU · 36 MMStar · 21 MMBench

Video Understanding

Video-MME · 24 MVBench · 20

Text-to-Image Generation

GenEval · 8 DPG-Bench MJHQ-30K FID

Agentic AI

10 tasks · 21 benchmark rows · 225 results

SWE-bench

SWE-bench Verified · 81

Task agents

AcademiClaw · 35 Collider-Bench · 6 MedMemoryBench · 2 PhysicianBench · 2 BFCL

Web & Desktop Agents

OSWorld · 28 WebArena · 11

Autonomous Coding

Terminal-Bench 2.0 · 20 SWE-bench Verified · 3

Computer Vision

15 tasks · 202 benchmark rows · 208 results

Object Detection

COCO · 79 LVIS v1.0 · 16 Pascal VOC 2012 · 9 COCO 2014 val COCO test-dev

Image Classification

ImageNet-1K · 47 ImageNet · 15 CIFAR-100 · 10 CIFAR-10 · 9 ImageNet-V2 · 6

Video classification

Kinetics-400 · 5 Something-Something V2 · 5 UCF-101 · 3 COIN Diving-48

Image segmentation

BSDS500 · 1 COCO 2017 Instance Segmentation · 1 LVIS (Instance Segmentation) · 1 ADE20K BRAVO (OOD)

§ 05 · API contract

Why ontology matters to callers.

The API can be simple because the ontology is explicit. A task alias resolves to a canonical task; the task points to benchmark protocols; each benchmark defines datasets and metrics; result rows decide the pick.

Example response shape

GET /api/sota/ocr

task           -> document-ocr
benchmark      -> omnidocbench
datasets       -> [omnidocbench-pages, docvqa-tables, formula-splits]
metric         -> edit / layout / table / formula aggregate
pick.model_id  -> paddleocr-vl-1.5
pick.score     -> numeric metric value
source_url     -> evidence trail
snapshot_id    -> cache key for this registry state

API docs →Changelog Benchmark lineages

How Codesotamodels evidence.

The nouns in the registry.

Capability area

Task

Benchmark

Dataset

Metric

Model

Result

Paper / source

The graph is small on purpose.

From claim to registry row.

Extract

Canonicalize

Validate

Rank

Revise

The hierarchy in practice.

Natural Language Processing

Computer Vision

Speech

Reasoning

Computer Code

Multimodal

Agentic AI

Computer Vision

Why ontology matters to callers.

How Codesota
models evidence.