hellaswag
Unknown
OCR benchmark
4
Total Results
4
Models Tested
1
Metrics
2025-12-19
Last Updated
accuracy
Higher is better
| Rank | Model | Score | Source |
|---|---|---|---|
| 1 | gpt-4o Commonsense NLI. Models now exceed human performance (95.6%). | 95.3 | openai-blog |
| 2 | gemini-15-pro | 92.5 | google-blog |
| 3 | claude-35-sonnet | 89 | anthropic-blog |
| 4 | llama-3-70b | 88 | meta-blog |