Natural Language Processing

Reading Comprehension

Understanding and answering questions about passages.

1 datasets0 resultsView full task mapping →

Reading comprehension tests a model's ability to answer questions about a given passage — the quintessential NLU evaluation. SQuAD launched the modern era, but benchmarks now span multi-hop reasoning (HotpotQA), conversational QA (CoQA), and adversarial probing (AdversarialQA). LLMs have saturated most benchmarks, shifting focus to harder multi-document and reasoning-intensive tasks.

History

2016

SQuAD 1.1 (Rajpurkar et al.) provides 100K+ extractive QA pairs; becomes the most cited NLP dataset

2017

BiDAF (Seo et al.) introduces bidirectional attention flow for passage comprehension

2018

BERT surpasses human F1 (93.2) on SQuAD 1.1, demonstrating that extractive RC is 'solved' for simple cases

2018

SQuAD 2.0 adds unanswerable questions — models must learn to abstain when no answer exists in the passage

2019

HotpotQA and MultiRC require multi-hop reasoning across multiple paragraphs

2019

CoQA and QuAC introduce conversational reading comprehension with follow-up questions

2020

UnifiedQA (Khashabi et al.) trains a single T5 model across 20+ RC datasets, showing format unification

2023

GPT-4 achieves near-perfect scores on SQuAD, CoQA, and NarrativeQA; focus shifts to harder benchmarks

2024

DROP (discrete reasoning over paragraphs) and IIRC (incomplete information) remain challenging for LLMs

How Reading Comprehension Works

1Input encodingQuestion and passage are co…2Span prediction (extr…Two linear heads predict st…3Answerability checkFor SQuAD 24Free-form generation …For generative RCReading Comprehension Pipeline
1

Input encoding

Question and passage are concatenated and encoded by the transformer; cross-attention allows the model to focus on relevant spans

2

Span prediction (extractive)

Two linear heads predict start and end positions of the answer span within the passage

3

Answerability check

For SQuAD 2.0-style tasks, a separate head predicts whether the question is answerable from the given passage

4

Free-form generation (abstractive)

For generative RC, the model produces the answer token by token, grounded in the passage context

Current Landscape

Reading comprehension in 2025 is a mature evaluation paradigm where standard benchmarks (SQuAD, CoQA, NarrativeQA) are effectively saturated by frontier LLMs. The task remains valuable as a component of more complex systems — RAG pipelines are essentially reading comprehension at scale. Research has shifted to harder variants: multi-hop reasoning (HotpotQA, MuSiQue), discrete reasoning (DROP), and adversarial robustness. The extractive RC paradigm (selecting spans) is being replaced by generative RC (free-form answers with citations).

Key Challenges

Multi-hop reasoning: questions requiring information from 2+ disconnected paragraphs remain much harder than single-hop

Shortcut exploitation: models often answer from passage-question lexical overlap rather than genuine comprehension

Free-form answer evaluation: comparing generated answers to references is error-prone (correct but differently worded answers score poorly)

Long-document comprehension: passages exceeding context windows require chunking strategies that may miss relevant spans

Conversational context: maintaining coreference and dialogue state across multi-turn QA is unsolved

Quick Recommendations

Best overall RC

GPT-4o or Claude 3.5 Sonnet

Near-perfect on SQuAD, CoQA; strong on multi-hop and reasoning-intensive benchmarks

Extractive RC (production)

DeBERTa-v3-large fine-tuned on SQuAD 2.0

93+ F1; no hallucination risk since answers are spans from the passage

Multi-dataset RC

UnifiedQA-v2 (T5-based)

Single model handles extractive, abstractive, multiple-choice, and yes/no QA formats

Conversational RC

GPT-4o with dialogue history in context

Handles coreference resolution and follow-up questions naturally

What's Next

The future of reading comprehension is its integration into agentic and multi-document reasoning systems. Standalone passage-level RC will give way to corpus-level QA where models must find, read, and synthesize across thousands of documents. Expect evaluation to shift from F1 on extracted spans to faithfulness and attribution quality in generated answers.

Benchmarks & SOTA

Related Tasks

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Reading Comprehension benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Reading Comprehension Benchmarks - Natural Language Processing - CodeSOTA | CodeSOTA