HellaSwag

Name: HellaSwag Benchmark Results
Creator: Unknown
License: https://creativecommons.org/licenses/by/4.0/

Unknown

70K sentence completion problems testing commonsense natural language inference.

Models4

Papers4

Metrics1

SOTA History

Not enough data to show trend.

Only 4 models on this benchmark

Help build the community leaderboard — submit your model results.

Higher is better

Rank	Model	Source	Score	Year	Paper
1	gpt-4o Commonsense NLI. Models now exceed human performance (95.6%).	Editorial	95.3	2025	Source
2	gemini-15-pro	Editorial	92.5	2025	Source
3	claude-35-sonnet	Editorial	89	2025	Source
4	llama-3-70b	Editorial	88	2025	Source