HotpotQA

Unknown

113K question-answer pairs requiring reasoning over multiple Wikipedia documents.

Benchmark Stats

Models2
Papers2
Metrics1

SOTA History

Not enough data to show trend.

Only 2 models on this benchmark

Help build the community leaderboard — submit your model results.

f1

f1

Higher is better

RankModelSourceScoreYearPaper
1gpt-4o

Multi-hop question answering requiring reasoning over Wikipedia.

Editorial71.32025Source
2claude-35-sonnetEditorial68.52025Source

Submit a Result