Codesota · Lineages11 lineages curated
Editorial · Lineages

How benchmarks evolve.

Each lineage traces where community attention moved from one saturated benchmark to the next — with the reason recorded on every transition. The attention path records where the leaderboard went; branches show specialized variants that remain active.

Current SOTA numbers pull from the live registry — we don't duplicate scores into editorial copy.

Published
Lineage · agentic7 benchmarks

Agentic AI Benchmarks

How evaluation of AI agents evolved from structured task completion in synthetic environments through real-world software engineering to open-ended computer use. The coding lineage (see coding.json…

6 active0 saturated3 branchesupdated 2026-04-27
Lineage · audio7 benchmarks

Audio Understanding Benchmarks

How audio AI evaluation evolved from environmental sound classification on small datasets through large-scale event detection to foundation-model-era benchmarks that combine audio perception with l…

5 active1 saturated2 branchesupdated 2026-04-27
Lineage · coding13 benchmarks

Coding Benchmarks

How code-generation evaluation moved from short Python functions to repository-scale software engineering. Attention path tracks the benchmark frontier focus has migrated to; branches show speciali…

8 active4 saturated6 branchesupdated 2026-04-26
Lineage · math6 benchmarks

Mathematical Reasoning Benchmarks

How mathematical reasoning evaluation evolved from grade-school word problems through competition mathematics to research-frontier problems that current AI cannot reliably solve. The lineage traces…

4 active1 saturated1 branchesupdated 2026-04-27
Lineage · multimodal6 benchmarks

Multimodal Reasoning Benchmarks

How vision-language model evaluation moved beyond visual question answering (covered in the VQA lineage) into multimodal reasoning — science, mathematics, chart understanding, and expert-level perc…

5 active0 saturated2 branchesupdated 2026-04-27
Lineage · nlp7 benchmarks

NLP Benchmarks

How natural language understanding evaluation evolved from narrow task-specific tests to multi-task suites, and then was eclipsed by 'reasoning' as the frontier label. GLUE unified disparate NLU ta…

1 active5 saturated3 branchesupdated 2026-04-27
Lineage · ocr12 benchmarks

OCR Benchmarks

How optical character recognition evaluation moved from word-level handwriting transcription to whole-document parsing with tables, charts and layout. Attention path tracks the frontier focus; bran…

9 active3 saturated5 branchesupdated 2026-04-27
Lineage · reasoning6 benchmarks

Reasoning Benchmarks

How evaluations of language-model reasoning evolved from broad knowledge testing to expert-level problem solving that frontier models still cannot reliably solve. The lineage runs from MMLU's wide-…

5 active1 saturated3 branchesupdated 2026-04-27
Lineage · speech7 benchmarks

Speech Recognition Benchmarks

How automatic speech recognition evaluation evolved from clean read speech on LibriSpeech, through multi-speaker and noisy conditions, toward naturalistic and multilingual benchmarks that reflect r…

6 active1 saturated3 branchesupdated 2026-04-27
Lineage · vision7 benchmarks

Vision Benchmarks

How computer vision evaluation moved from image classification on ImageNet through object detection and dense prediction on COCO, to open-world promptable segmentation with SA-1B and SA-V. The line…

3 active3 saturated2 branchesupdated 2026-04-27
Lineage · vqa9 benchmarks

Visual Question Answering

From the original image+question task to broad multimodal reasoning. The attention path tracks where leaderboard focus has moved; branches show specialized variants that remain active.

5 active3 saturated5 branchesupdated 2026-04-23