Codesota · Lineage · Visual Question Answering9 benchmarks · 8 edgesUpdated 2026-04-23
Benchmark lineage

Visual Question Answering

From the original image+question task to broad multimodal reasoning. The attention path tracks where leaderboard focus has moved; branches show specialized variants that remain active.

Editor's note

MMMU is placed on the attention path as a scope_shift — it's not strictly the same task as VQAv2, but the field's attention migrated there once VQAv2 saturated. Specialized VQA variants (knowledge, text, compositional) are shown as branches and remain active in their own right.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded
DIRECT SUCCESSORSCOPE SHIFTDIRECT SUCCESSORVQAMAY 2015VQAv2APR 2017SOTA 87.6%MMMUNOV 2023SOTA 86.0%MMMU-ProSEP 2024SOTA 82.0%GQAMAY 2019OK-VQAJUN 2019A-OKVQAJUN 2022TextVQAAPR 2019SOTA 85.5%ScienceQASEP 2022
VQAVQAv2 · direct successor · attention
Models answered correctly without looking at the image — VQAv2's balanced pairs force visual grounding.
VQAv2MMMU · scope shift · attention
VQAv2 saturated above 85%. Leaderboard attention moved to broader multimodal reasoning.
MMMUMMMU-Pro · direct successor · attention
Text-only shortcuts and narrow answer space in MMMU. Pro adds vision-only questions and ten answer choices.
VQAv2GQA · fork
Compositional structure over scene graphs.
VQAv2OK-VQA · fork
External-knowledge requirement.
OK-VQAA-OKVQA · direct successor
Broader knowledge types and better annotation.
VQAv2TextVQA · fork
Reading text in the image — OCR-grounded sub-task.
VQAv2ScienceQA · fork
Multimodal chain-of-thought reasoning.
§ 02 · Benchmarks in this lineage

Nodes in detail.

May 2015Superseded

VQA

Visual Question Answering

The original image+question → answer task on COCO images. Established the task.

Antol et al. · paper
Apr 2017Saturated
View benchmark page →

VQAv2

Balanced pairs kill language priors — each question has two similar images with different answers so models must actually look.

Goyal et al. · paper

TextVQA

VQA requiring reading text embedded in images — the OCR↔VQA bridge.

Singh et al. · paper
May 2019Saturated
View benchmark page →

GQA

Compositional VQA grounded in scene graphs.

Hudson & Manning · paper

OK-VQA

VQA requiring external world knowledge the image alone doesn't provide.

Marino et al. · paper
Jun 2022Active

A-OKVQA

OK-VQA successor with more diverse knowledge types and commonsense reasoning.

Schwenk et al. · paper
Sep 2022Active

ScienceQA

Multimodal chain-of-thought reasoning on science questions.

Lu et al. · paper
Nov 2023Saturating
View benchmark page →

MMMU

Massive Multi-discipline Multimodal Understanding

College-level multimodal questions across 30 subjects. Broader scope than task-specific VQA, where leaderboard attention moved once VQAv2 saturated.

Yue et al. · paper

MMMU-Pro

Harder MMMU with vision-only questions and ten answer choices — fixes the text-only shortcuts readers exploited in MMMU.

Yue et al. · paper