How benchmarks evolve.
Each lineage traces where community attention moved from one saturated benchmark to the next — with the reason recorded on every transition. The attention path records where the leaderboard went; branches show specialized variants that remain active.
Current SOTA numbers pull from the live registry — we don't duplicate scores into editorial copy.
Agentic AI Benchmarks
How evaluation of AI agents evolved from structured task completion in synthetic environments through real-world software engineering to open-ended computer use. The coding lineage (see coding.json…
Audio Understanding Benchmarks
How audio AI evaluation evolved from environmental sound classification on small datasets through large-scale event detection to foundation-model-era benchmarks that combine audio perception with l…
Coding Benchmarks
How code-generation evaluation moved from short Python functions to repository-scale software engineering. Attention path tracks the benchmark frontier focus has migrated to; branches show speciali…
Mathematical Reasoning Benchmarks
How mathematical reasoning evaluation evolved from grade-school word problems through competition mathematics to research-frontier problems that current AI cannot reliably solve. The lineage traces…
Multimodal Reasoning Benchmarks
How vision-language model evaluation moved beyond visual question answering (covered in the VQA lineage) into multimodal reasoning — science, mathematics, chart understanding, and expert-level perc…
NLP Benchmarks
How natural language understanding evaluation evolved from narrow task-specific tests to multi-task suites, and then was eclipsed by 'reasoning' as the frontier label. GLUE unified disparate NLU ta…
OCR Benchmarks
How optical character recognition evaluation moved from word-level handwriting transcription to whole-document parsing with tables, charts and layout. Attention path tracks the frontier focus; bran…
Reasoning Benchmarks
How evaluations of language-model reasoning evolved from broad knowledge testing to expert-level problem solving that frontier models still cannot reliably solve. The lineage runs from MMLU's wide-…
Speech Recognition Benchmarks
How automatic speech recognition evaluation evolved from clean read speech on LibriSpeech, through multi-speaker and noisy conditions, toward naturalistic and multilingual benchmarks that reflect r…
Vision Benchmarks
How computer vision evaluation moved from image classification on ImageNet through object detection and dense prediction on COCO, to open-world promptable segmentation with SA-1B and SA-V. The line…
Visual Question Answering
From the original image+question task to broad multimodal reasoning. The attention path tracks where leaderboard focus has moved; branches show specialized variants that remain active.