HellaSwag

Unknown

70K sentence completion problems testing commonsense natural language inference.

Benchmark Stats

Models4
Papers4
Metrics1

SOTA History

Coming Soon
Visual timeline of state-of-the-art progression over time will appear here.

accuracy

accuracy

Higher is better

RankModelCodeScorePaper / Source
1gpt-4o

Commonsense NLI. Models now exceed human performance (95.6%).

-95.3openai-blog
2gemini-15-pro-92.5google-blog
3claude-35-sonnet-89anthropic-blog
4llama-3-70bHF88meta-blog