Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky questions, and hallucination resistance.
Evaluates LLMs on understanding Polish text across 4 dimensions: sentiment analysis, language understanding (implicatures, author intent), phraseology (idioms, phraseological compounds), and tricky questions (logic, ambiguity, hallucination resistance). Score range 0-5 per category. 378 hand-written examples. Created by SpeakLeash/Spichlerz.
Leading models on CPTU-Bench.
| # | Model | tricky-questions | Year | Source |
|---|---|---|---|---|
| ★ | Qwen/Qwen3.5-35B-A3B thinking (API)✓ | 4.70 | 2025 | paper ↗ |
| 2 | Qwen/Qwen3.5-27B thinking (API)✓ | 4.61 | 2025 | paper ↗ |
| 3 | gemini-2.0-flash-001✓ | 4.52 | 2025 | paper ↗ |
| 4 | deepseek-ai/DeepSeek-R1 (API)✓ | 4.49 | 2025 | paper ↗ |
| 5 | deepseek-ai/DeepSeek-V3.2 (API)✓ | 4.46 | 2025 | paper ↗ |
| 6 | Qwen/Qwen3.5-27B non-thinking (API)✓ | 4.43 | 2025 | paper ↗ |
| 7 | deepseek-ai/DeepSeek-V3.1 (API)✓ | 4.42 | 2025 | paper ↗ |
| 8 | Qwen/Qwen3.5-27B thinking (API)✓ | 4.42 | 2025 | paper ↗ |
| 9 | meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (API)✓ | 4.39 | 2025 | paper ↗ |
| 10 | moonshotai/Kimi-K2-Instruct-0905 (API)✓ | 4.39 | 2025 | paper ↗ |
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
Still looking for something on Polish Text Understanding? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.