Question Answering
Answer questions based on context or knowledge. Foundation for chatbots, search, and knowledge systems.
How Question Answering Works
A technical deep-dive into Question Answering systems. From extractive span prediction to generative reasoning, and from fine-tuned models to RAG pipelines.
The Core Insight
Why question answering is fundamentally different from search.
Information is buried in text. Users have questions, but documents have paragraphs. Traditional search returns documents, not answers. You ask 'When was Einstein born?' and get 10 articles about Einstein instead of '1879.'
Train models to read passages and either extract exact answer spans (extractive QA) or generate natural language answers (generative QA). The model becomes a reading comprehension machine.
QA models don't just match keywords - they understand the question's intent and locate or synthesize the precise information needed to answer it.
Search Returns Documents. QA Returns Answers.
Extractive vs Generative QA
The fundamental choice: copy from the text or generate an answer.
Interactive Demo: See QA in Action
Watch how extractive QA highlights answer spans in the passage.
The Context Window Problem
How much text can your model read at once?
Every QA model has a limit on how much text it can read at once. BERT reads ~512 tokens. GPT-4 reads ~128K tokens. When your documents exceed this limit, you must choose: truncate and miss information, or retrieve relevant passages first (RAG).
Context Window Sizes
- - Must chunk documents and retrieve relevant pieces
- - Risk missing information if retrieval fails
- - Cannot answer questions spanning multiple sections
- + Can read entire documents without chunking
- + No retrieval errors possible
- + Simpler pipeline (no vector DB needed)
RAG vs Fine-tuned vs Hybrid
Three fundamentally different ways to add domain knowledge to your QA system.
Train the model on your specific domain data
Retrieve relevant documents, then generate answer from context
Fine-tuned model with RAG augmentation
Quick Decision Guide
Models for Question Answering
From BERT to GPT-4: how QA models have evolved.
| Model | Type | Params | Context | Best For |
|---|---|---|---|---|
BERT-base-uncased Google | Extractive | 110M | 512 tokens | Production extractive QA with low latency requirements |
RoBERTa-large Meta | Extractive | 355M | 512 tokens | When you need better accuracy than BERT |
Flan-T5-XL Google | Generative | 3B | 512 tokens | Generative QA with instruction following |
GPT-4 OpenAI | Generative | Unknown | 128K tokens | Complex QA requiring reasoning or long documents |
Llama 3 70B Meta | Generative | 70B | 8K tokens | Self-hosted generative QA without API dependencies |
Benchmarks
Standard datasets for evaluating question answering systems.
| Benchmark | Type | Size | Metric | SOTA |
|---|---|---|---|---|
SQuAD 2.0 Wikipedia paragraphs with unanswerable questions | Reading Comprehension | 150K QA pairs | EM / F1 | 93.2 / 95.3 (Human: 86.8 / 89.5) |
Natural Questions Real Google search questions | Open Domain | 307K QA pairs | EM / F1 | 52.7 / 58.9 |
TriviaQA Trivia questions with evidence documents | Knowledge-Intensive | 95K QA pairs | EM / F1 | 73.3 / 77.5 |
HotpotQA Questions requiring reasoning over multiple docs | Multi-hop Reasoning | 113K QA pairs | EM / F1 | 72.5 / 84.8 |
QuALITY Questions about full-length articles and stories | Long Document | 6.7K QA pairs | Accuracy | 62.3% |
Binary: 1 if prediction exactly matches any ground truth answer, 0 otherwise. Strict but clear - the answer must be character-for-character identical.
Token-level overlap between prediction and ground truth. Computed as harmonic mean of precision and recall. More forgiving than EM for partial matches.
Code Examples
From quick extractive QA to production RAG pipelines.
from transformers import pipeline
# Load a pre-trained extractive QA model
qa_pipeline = pipeline(
"question-answering",
model="deepset/roberta-base-squad2",
device=0 # GPU, use -1 for CPU
)
# Your context passage
context = """
Albert Einstein was born on March 14, 1879, in Ulm, Germany.
He developed the theory of special relativity in 1905 while
working as a patent clerk in Bern, Switzerland.
"""
# Ask a question
result = qa_pipeline(
question="When was Einstein born?",
context=context
)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['score']:.3f}")
print(f"Start: {result['start']}, End: {result['end']}")
# Output:
# Answer: March 14, 1879
# Confidence: 0.987
# Start: 28, End: 42Quick Reference
- - BERT, RoBERTa, ALBERT
- - Fast, grounded, no hallucination
- - Limited to verbatim text
- - T5, GPT-4, Llama
- - Natural answers, reasoning
- - Watch for hallucination
- - Retrieve then generate
- - Easy to update knowledge
- - Citable sources
- - Context size determines approach
- - Static vs dynamic knowledge
- - Speed vs accuracy trade-off
- 1. Extractive QA is fast and grounded but cannot synthesize or rephrase
- 2. Generative QA is flexible but requires hallucination mitigation
- 3. Context window determines whether you need RAG or can use long-context LLMs
- 4. Start with RAG for most use cases - easier to update and maintain
Use Cases
- ✓Customer support bots
- ✓Knowledge base search
- ✓Reading comprehension
- ✓FAQ automation
Architectural Patterns
Extractive QA
Find answer spans within provided context.
- +Grounded in source
- +Fast
- +No hallucination
- -Needs context provided
- -Can't synthesize
Generative QA
Generate answers using LLMs with retrieved context.
- +Fluent answers
- +Can synthesize
- +Handles complex questions
- -May hallucinate
- -Slower
- -Needs good retrieval
Open-Domain QA
Answer from parametric knowledge without context.
- +No retrieval needed
- +Simple pipeline
- -Hallucination risk
- -Knowledge cutoff
- -Can't cite sources
Implementations
API Services
Perplexity API
PerplexityReal-time search + generation. Cites sources.
You.com API
You.comSearch-augmented answers. Good for current events.
Open Source
DPR (Dense Passage Retrieval)
Apache 2.0Retrieval for open-domain QA. Use with reader.
Benchmarks
Quick Facts
- Input
- Text
- Output
- Text
- Implementations
- 3 open source, 2 API
- Patterns
- 3 approaches