Codesota · Tasks · Polish Text UnderstandingHome/Tasks/Natural Language Processing/Polish Text Understanding

Polish Text Understanding.

Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky questions, and hallucination resistance.

Datasets

465

Results

average

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

CPTU-Bench

Evaluates LLMs on understanding Polish text across 4 dimensions: sentiment analysis, language understanding (implicatures, author intent), phraseology (idioms, phraseological compounds), and tricky questions (logic, ambiguity, hallucination resistance). Score range 0-5 per category. Created by SpeakLeash/Spichlerz.

Primary metric: average

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on CPTU-Bench.

#	Model	tricky-questions	Year	Source
★	Qwen/Qwen3.5-35B-A3B thinking (API)✓	4.70	2026	paper ↗
2	Qwen/Qwen3.5-27B thinking (API)✓	4.61	2026	paper ↗
3	gemini-2.0-flash-001✓	4.52	2026	paper ↗
4	deepseek-ai/DeepSeek-R1 (API)✓	4.49	2026	paper ↗
5	deepseek-ai/DeepSeek-V3.2 (API)✓	4.46	2026	paper ↗
6	Qwen/Qwen3.5-27B non-thinking (API)✓	4.43	2026	paper ↗
7	deepseek-ai/DeepSeek-V3.1 (API)✓	4.42	2026	paper ↗
8	Qwen/Qwen3.5-27B thinking (API)✓	4.42	2026	paper ↗
9	meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (API)✓	4.39	2026	paper ↗
10	moonshotai/Kimi-K2-Instruct-0905 (API)✓	4.39	2026	paper ↗

What were you looking for on Polish Text Understanding?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

CPTU-Bench

CANONICAL

465 results · average

Top: Qwen/Qwen3.5-35B-A3B thinking (API) — 4.70

§ 05 · Related tasks

Other tasks in Natural Language Processing.

Feature Extraction Fill-Mask Named Entity Recognition Natural Language Inference Polish Conversation Quality Polish Cultural Competency Polish Emotional Intelligence Polish LLM General

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Polish Text Understanding? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.