HellaSwag
Unknown
70K sentence completion problems testing commonsense natural language inference.
Benchmark Stats
Models4
Papers4
Metrics1
SOTA History
Coming SoonVisual timeline of state-of-the-art progression over time will appear here.
accuracy
accuracy
Higher is better
| Rank | Model | Code | Score | Paper / Source |
|---|---|---|---|---|
| 1 | gpt-4o Commonsense NLI. Models now exceed human performance (95.6%). | - | 95.3 | openai-blog |
| 2 | gemini-15-pro | - | 92.5 | google-blog |
| 3 | claude-35-sonnet | - | 89 | anthropic-blog |
| 4 | llama-3-70b | HF | 88 | meta-blog |