Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota - NLP - Question AnsweringSQuAD - TriviaQA - Natural QuestionsTask page
00 - Question Answering

Question answering task router

QA is not one task. Span extraction, open-domain retrieval, multi-hop reasoning, and conversational answers have different failure modes. Start from the evidence source and the answer shape you need.

Benchmark
SQuAD - TriviaQA - Natural Questions
Current pick
GPT-5 / Claude 4
contextAanswer
01 - Explainer

What this task measures.

Question answering systems map a question plus evidence into an answer, a refusal, or a cited explanation. Extractive QA measures whether the model can find a span in a passage; open-domain QA adds retrieval; multi-hop QA tests whether the system can combine evidence across documents. The modern production version is usually RAG with citation and abstention checks.

02 - Benchmarks

Use a benchmark ladder.

One leaderboard rarely captures the task. Use the canonical benchmark for lineage, then add harder or more domain-specific checks before choosing a model.

BenchmarkRoleMetricCaveat
SQuAD 2.0Extractive QA lineageExact Match / F1Saturated and passage-bound; useful for span extraction, not broad QA reliability.
Natural QuestionsOpen-domain QALong answer / short answer F1Closer to search QA, but still rewards answer overlap more than source faithfulness.
HotpotQAMulti-hop reasoningJoint EM / F1Tests linked evidence, but systems can exploit dataset artifacts without robust reasoning.
RAG evalProduction QAGroundedness / citation support / refusal rateNeeds local documents and human review for high-liability domains.
03 - Evaluation

What to compare.

The public benchmark is a shortlist signal. Production choice still depends on latency, cost, domain drift, and how expensive mistakes are.

AxisValueWhy it matters
Extractive QASQuAD 2.0Best when the answer must be a span from supplied context.
Knowledge QATriviaQA / Natural QuestionsTests retrieval and answer selection from broader evidence.
Modern production QARAG + answer verificationMost real systems need retrieval, citation, and refusal behavior.
Failure modeFluent hallucinated answerMeasure groundedness and source support, not just answer text overlap.
04 - Routing

Pick by task shape.

Answer from known document

Extractive reader

Cheaper and more auditable when the answer is a literal span.

Answer from corpus

Retriever + generator

Retrieval controls freshness and lets the answer cite evidence.

Multi-hop question

Reasoning LLM + citations

The model must combine facts across passages and show support.

High-liability QA

Answer verifier

Add a second pass for source coverage, contradiction, and abstention.

05 - Related

Need implementation details?

Open the lower-level explainer for architecture, code examples, and implementation options.

Open SQuAD editorial ->