HellaSwag

Unknown

70K sentence completion problems testing commonsense natural language inference.

Benchmark Stats

Models4
Papers4
Metrics1

SOTA History

Not enough data to show trend.

Only 4 models on this benchmark

Help build the community leaderboard — submit your model results.

accuracy

accuracy

Higher is better

RankModelSourceScoreYearPaper
1gpt-4o

Commonsense NLI. Models now exceed human performance (95.6%).

Editorial95.32025Source
2gemini-15-proEditorial92.52025Source
3claude-35-sonnetEditorial892025Source
4llama-3-70bEditorial882025Source

Submit a Result

HellaSwag Leaderboard | CodeSOTA | CodeSOTA