Codesota · LLMs · PolishThe register of Polish-language evaluationLive · May 2026

§ 00 · Polish LLMs

Polish, measured honestly.

Five benchmarks, two home-grown model families. We compare language models on PLCC, CPTU-Bench, the Open PL LLM Leaderboard, KLEJ NER, PolEval, Polish MT-Bench and Polish EQ-Bench — and track every version of Bielik and PLLuM across the whole panel.

All scores are live from the Codesota registry. Shaded rows mark current state of the art on that benchmark. Models are not retracted silently; when a score moves, the previous entry is kept.

The five benchmarks →Bielik tracker PLLuM tracker KLEJ / PolEval

§ 01 · Benchmarks

Five panels, different angles.

Polish-language evaluation is not one task. The five panels below each probe a different competence — cultural knowledge, text understanding, conversation, emotion, general reasoning.

Benchmark	What it tests	Models	SOTA	Source
PLCC	Polish Linguistic and Cultural Competency — grammar, idioms, cultural references.	165	97	leaderboard →
CPTU-Bench	Complex Polish Text Understanding — comprehension of nuanced, multi-layered Polish.	93	4.34	leaderboard →
Open PL LLM Leaderboard	Multi-task Polish evaluation — reasoning, knowledge and language understanding.	660	69.84	leaderboard →
Polish MT-Bench	Multi-turn conversation quality — dialogue coherence and context retention in Polish.	50	9.28	leaderboard →
Polish EQ-Bench	Emotional intelligence in Polish — understanding of emotions and social nuance.	101	78.07	leaderboard →

Fig 1 · Five Polish-language panels. Model counts reflect current tracker coverage.

§ 01b · Polish filters

The named checks buyers ask for.

KLEJ, PolEval and Bielik should not be buried under one average. The table below maps each request to the concrete CodeSOTA route and search filter.

979 tracked rows

KLEJ

Tracked as KLEJ NER inside Open PL

Useful when the buyer cares about named-entity recognition rather than generic reasoning.

open route →filtered search →

56 tracked rows

PolEval

PolEval 2018 Task 3 in Open PL; PolEval 2021 on Polish OCR

The LLM page links the Open PL PolEval row; the OCR register covers PolEval 2021 post-correction.

open route →filtered search →

routing

Bielik bench

Bielik family tracked across PLCC, CPTU, Open PL, MT-Bench PL, EQ-Bench PL

There is no single honest “Bielik bench” row; the page shows Bielik versions across the Polish dataset panel.

open route →filtered search →

routing

Dataset/language filter

Search now accepts language=pl and dataset filters

Use search chips for language PL plus a specific dataset such as PLCC, CPTU, Open PL, or PolEval OCR.

open route →filtered search →

§ 02 · Task

What “Polish-ready” actually means.

A model that scores well on English MMLU can still fail at a Polish morphology puzzle. Polish has seven cases, free word order, and a century of literature that carries oblique cultural reference. English-first frontier models often translate internally; the loss is invisible on English benchmarks but legible in Polish conversation.

PLCC probes cultural and linguistic competence directly — idioms, registers, references to national authors. CPTU-Bench scores comprehension of nuanced literary and scientific prose. The Open PL LLM Leaderboard is the multi-task catch-all. MT-Bench and EQ-Bench in their Polish variants test conversation and emotional inference respectively.

The two home-grown families — Bielik from SpeakLeash, PLLuM from OPI — are Polish-first by design. They do not always top the boards, but they are the cleanest control for how much of a model's Polish performance is earned by training and how much is a translation trick.

§ 03 · Bielik tracker

SpeakLeash, every version.

Bielik — Polish for white-tailed eagle — is developed by SpeakLeash with a custom APT4 tokeniser and trained on 292B+ tokens of Polish text. Apache 2.0 licensed.

SpeakLeash on HuggingFace →Bielik v2 technical report →Bielik v3 technical report →

Model	PLCC	CPTU-Bench	Open PL LLM Leaderboard	Polish MT-Bench	Polish EQ-Bench
Bielik-0.1	37	—	—	—	—
Bielik-1.5B-v1.0-DPO-001-L2	—	—	45.11	—	—
Bielik-1.5B-v1.0-DPO-001-L3	—	—	74	—	—
Bielik-1.5B-v1.0-DPO-001-L3-copy	—	—	43.68	—	—
Bielik-1.5B-v1.0-m3	—	—	44.93	—	—
Bielik-1.5B-v1.0-m3b	—	—	41.06	—	—
Bielik-1.5B-v1.0-m4	—	—	40.77	—	—
Bielik-1.5B-v3	—	—	19.23	—	—
Bielik-1.5B-v3.0-Instruct	27	—	—	—	—
Bielik-1.5B-v3.0-Instruct-RC042025	—	—	69.40	—	—
Bielik-1.5B-v3.0-Instruct-SFT-RC042025	—	—	21.24	—	—
Bielik-11B-v2	—	—	60.42	—	—
Bielik-11B-v2.0-Instruct	—	—	—	5.50	—
Bielik-11B-v2.1-Instruct	—	—	—	9.13	—
Bielik-11B-v2.2-Instruct	—	—	—	8.12	—
Bielik-11B-v2.2-M-1.2	—	—	86.15	—	—
Bielik-11B-v2.3-Instruct	—	—	—	9.50	—
Bielik-11B-v2.3-Instruct-AWQ	—	—	65.22	—	—
Bielik-11B-v2.3-Instruct-GPTQ	—	—	46.75	—	—
Bielik-11B-v2.3-Instruct.IQ1_M.gguf.IQ	—	—	44.02	—	—
Bielik-11B-v2.3-Instruct.IQ2_XXS.gguf.IQ	—	—	61.34	—	—
Bielik-11B-v2.3-Instruct.IQ3_XXS.gguf.IQ	—	—	54.62	—	—
Bielik-11B-v2.3-Instruct.Q4_K_M.gguf	—	—	52.48	—	—
Bielik-11B-v2.3-Instruct.Q4_K_M.gguf.IQ	—	—	54.52	—	—
Bielik-11B-v2.3-Instruct.Q6_K.gguf	—	—	91.13	—	—
Bielik-11B-v2.3-Instruct.Q8_0.gguf	—	—	65.76	—	—
Bielik-11B-v2.4-Instruct-MS	—	—	65.51	—	—
Bielik-11B-v2.4-Instruct-SL	—	—	92.31	—	—
Bielik-11B-v2.4-Instruct-TI	—	—	79.30	—	—
Bielik-11B-v2.5-Instruct-D-GRPO_H_070	—	—	64.57	—	—
Bielik-11B-v2.5-Instruct-GRPO_010	—	—	71.20	—	—
Bielik-11B-v2.5-Instruct-GRPO_020	—	—	61.58	—	—
Bielik-11B-v2.5-Instruct-GRPO_030	—	—	87.33	—	—
Bielik-11B-v2.5-Instruct-GRPO_040	—	—	35.24	—	—
Bielik-11B-v2.5-Instruct-GRPO_050	—	—	64.18	—	—
Bielik-11B-v2.5-Instruct-GRPO_060	—	—	91.04	—	—
Bielik-11B-v2.5-Instruct-GRPO_H_010	—	—	84.21	—	—
Bielik-11B-v2.5-Instruct-GRPO_H_030	—	—	84.07	—	—
Bielik-11B-v3-Base-20250730	—	—	33.24	—	—
Bielik-11B-v3.0-Instruct	57	—	—	—	—
Bielik-11B-v3.0-Instruct-FP8-Dynamic	—	—	65.32	—	—
Bielik-11B-v3.0-Instruct.Q4_K_M.gguf	—	—	65.09	—	—
Bielik-11B-v3.0-Instruct.Q6_K.gguf	—	—	63.61	—	—
Bielik-11B-v3.0-Instruct.Q8_0.gguf	—	—	80.20	—	—
Bielik-2.1	68	—	—	—	—
Bielik-2.2	62	—	—	—	—
Bielik-2.3	62.17	—	—	—	—
Bielik-2.5	62	—	—	—	—
Bielik-2.6	55	—	—	—	—
Bielik-4.5B-v3	—	—	84.78	—	—
Bielik-4.5B-v3.0-Instruct	42.33	—	—	—	—
Bielik-4.5B-v3.0-Instruct-SFT-RC042025	—	—	40.67	—	—
Bielik-7B-Instruct-v0.1	—	—	—	7.85	—
Bielik-7B-Instruct-v0.1-GPTQ	—	—	55.62	—	—
Bielik-7B-v0.1	—	—	34.34	—	—
Bielik-Minitron-7B-v3.0-Instruct	57	—	—	—	—
Bielik-PL-11B-v3.0-Instruct	—	—	57.92	—	—
Bielik-PL-Minitron-7B-v3.0-Instruct	—	—	52.87	—	—
Bielik-SOLAR-LIKE-10.7B-Instruct-v0.1	—	—	69.22	—	34.17
minitron-Bielik-7B-v3.0-Instruct-GGUF.Q4_K_M.gguf	—	—	79	—	—
minitron-Bielik-7B-v3.0-Instruct-GGUF.Q6_K.gguf	—	—	58.89	—	—
minitron-Bielik-7B-v3.0-Instruct-GGUF.Q8_0.gguf	—	—	82.69	—	—
MSH-Lite-7B-v1-Bielik-v2.3-Instruct-Llama-Prune	—	—	39.36	—	—
MSH-v1-Bielik-v2.3-Instruct-MedIT-merge	—	—	37.76	—	—
speakleash/Bielik-1.5B-v3.0-Instruct	—	1.22	41.36	—	—
speakleash/Bielik-11B-v2.0-Instruct	—	3.26	73.61	—	68.24
speakleash/Bielik-11B-v2.1-Instruct	—	3.96	83.64	—	60.07
speakleash/Bielik-11B-v2.2-Instruct	—	3.73	56.77	—	69.05
speakleash/Bielik-11B-v2.3-Instruct	—	3.22	53.11	—	70.86
speakleash/Bielik-11B-v2.5-Instruct	—	3.13	63.95	—	72.00
speakleash/Bielik-11B-v2.6-Instruct	—	4.10	61.89	—	73.70
speakleash/Bielik-11B-v3.0-Instruct	—	3.19	43.52	—	71.20
speakleash/Bielik-4.5B-v3.0-Instruct	—	2.46	73.41	—	53.58
speakleash/Bielik-7B-Instruct-v0.1	—	2.88	23	—	31.26
speakleash/Bielik-Minitron-7B-v3.0-Instruct	—	2.74	—	—	—

Raw scores. Each column uses a different metric and scale — compare within a column, not across.

§ 04 · PLLuM tracker

OPI, 8B to 70B.

PLLuM (Polish Large Language Universal Model) is developed by OPI, the National Information Processing Institute, as part of a government-backed initiative to build open Polish AI infrastructure. Models range from 8B to 70B parameters.

PLLuM project page →

Model	PLCC	CPTU-Bench	Open PL LLM Leaderboard	Polish MT-Bench	Polish EQ-Bench
CYFRAGOVPL/Llama-PLLuM-70B-chat	—	3.94	—	—	72.56
CYFRAGOVPL/Llama-PLLuM-70B-instruct	—	3.33	—	—	69.99
CYFRAGOVPL/Llama-PLLuM-8B-chat	—	3.13	—	—	46.20
CYFRAGOVPL/Llama-PLLuM-8B-instruct	—	1.66	—	—	—
CYFRAGOVPL/PLLuM-12B-chat	—	2.59	—	—	52.26
CYFRAGOVPL/PLLuM-12B-instruct	—	3.09	—	—	36.21
CYFRAGOVPL/PLLuM-12B-nc-chat	—	3.22	—	—	—
CYFRAGOVPL/pllum-12b-nc-chat-250715	—	3.96	—	—	55.17
CYFRAGOVPL/PLLuM-12B-nc-instruct	—	3.31	—	—	—
CYFRAGOVPL/pllum-12b-nc-instruct-250715	—	3.91	—	—	—
CYFRAGOVPL/PLLuM-8x7B-chat	—	3.45	—	—	45.22
CYFRAGOVPL/PLLuM-8x7B-instruct	—	3.46	—	—	39.55
CYFRAGOVPL/PLLuM-8x7B-nc-chat	—	3.48	—	—	47.29
CYFRAGOVPL/PLLuM-8x7B-nc-instruct	—	1.76	—	—	41.75
Llama-PLLuM-70B-chat	50	—	—	8.05	—
Llama-PLLuM-70B-chat-250801	54	—	—	—	—
Llama-PLLuM-8B-chat	38.50	—	—	9.50	—
PLLuM-12B-chat	37	—	—	6.55	—
PLLuM-12B-nc-chat	41	—	—	4.55	—
PLLuM-12B-nc-chat-250715	52	—	—	—	—
PLLuM-8x7B-chat	60	—	—	7.10	—
PLLuM-8x7B-nc-chat	68.17	—	—	4.95	—

Raw scores. Each column uses a different metric and scale — compare within a column, not across.

§ 05 · Leaderboards

Three panels, top ten each.

PLCC, CPTU-Bench and the Open PL LLM Leaderboard have the deepest coverage. The other two are shown in-line via the sidebar SOTA ticker.

PLCC · top 10

Shaded row marks current SOTA

#	Model	Score
01	Gemini-3.1-Pro-Preview	97
02	Gemini-3.0-Pro-Preview	95.83
03	GPT-5.4-2026-03-05 (high reasoning)	92.17
04	Gemini-2.5-Pro-Preview-06-05	92.17
05	Gemini-3-Flash-Preview	91.67
06	GPT-5-Pro-2025-10-06 (high reasoning)	91
07	GPT-5.4-2026-03-05 (low reasoning)	90.50
08	Grok 4	90.50
09	Gemini-2.5-Pro-Exp-03-25	89.50
10	GPT-5-2025-08-07	89.50

CPTU-Bench · top 10

Shaded row marks current SOTA

#	Model	Score
01	Qwen/Qwen3.5-27B thinking (API)	4.34
02	gemini-2.0-flash-001	4.29
03	Qwen/Qwen3.5-27B non-thinking (API)	4.27
04	Qwen/Qwen3.5-35B-A3B thinking (API)	4.22
05	Qwen/Qwen3.5-35B-A3B non-thinking (API)	4.18
06	deepseek-ai/DeepSeek-V3.2 (API)	4.14
07	deepseek-ai/DeepSeek-R1 (API)	4.14
08	gemini-2.0-flash-lite-001	4.09
09	deepseek-ai/DeepSeek-V3-0324 (API)	4.03
10	deepseek-ai/DeepSeek-V3.1 (API)	4.03

Open PL LLM Leaderboard · top 10

Shaded row marks current SOTA

#	Model	Score
01	mistralai/Mistral-Large-Instruct-2411	69.84
02	Meta-Llama-3.1-405B-Instruct-FP8	69.44
03	mistralai/Mistral-Large-Instruct-2407	69.11
04	Qwen/Qwen2.5-72B-Instruct	67.92
05	Qwen2.5-72B	67.38
06	QwQ-32B-Preview	67.01
07	Qwen2.5-32B	66.73
08	meta-llama/Llama-3.3-70B-Instruct	66.40
09	Qwen2-72B	66.02
10	remek/v3/rl-instruct/110k	65.99

§ 06

Methodology

How the Polish register is kept.

Scores are pulled live from the Codesota benchmark database; the page has no hand-written numbers. When the underlying benchmark owners update their tables, our page updates on the next request.

Bielik and PLLuM are the two Polish-first families worth cross-tracking explicitly. The tables in § 03 and § 04 show every version we find, across every benchmark they have been evaluated on. Gaps are gaps — we do not impute.

PLCC and CPTU-Bench use different scoring conventions; comparing raw numbers across benchmarks is meaningless. The SOTA tickers group by benchmark for a reason.

Neighbouring registers.

Cross-links to the rest of Codesota.

ZusWaveBench →

Polish bureaucracy, tax & ZUS reasoning — Bielik & PLLuM vs frontier LLMs.

LLMs · register →

Frontier English-first LLM benchmarks.

Polish OCR →

Polish document recognition and post-correction.

All tasks →

Every modality Codesota tracks.

Methodology →

How scores are admitted and retracted.