HotpotQA
Unknown
113K question-answer pairs requiring reasoning over multiple Wikipedia documents.
Benchmark Stats
Models2
Papers2
Metrics1
SOTA History
Coming SoonVisual timeline of state-of-the-art progression over time will appear here.
f1
f1
Higher is better
| Rank | Model | Code | Score | Paper / Source |
|---|---|---|---|---|
| 1 | gpt-4o Multi-hop question answering requiring reasoning over Wikipedia. | - | 71.3 | arXiv Paper |
| 2 | claude-35-sonnet | - | 68.5 | arXiv Paper |