Natural Language Processing
Processing and understanding text? Evaluate your models on language understanding, generation, translation, and information extraction benchmarks.
NLP in 2025 has matured from research systems to production infrastructure. Frontier models reach 95% on MMLU while solving alignment problems through Constitutional AI. The field now balances raw capability with practical deployment concerns.
State of the Field (2025)
- Frontier models (GPT-5.1, Claude 3.5, DeepSeek-V3) achieve 88-92% on MMLU with context windows expanding to 2M tokens, though long-context reasoning still degrades with scale
- Open-source models now match proprietary performance - DeepSeek-V3 (88.5 MMLU) rivals GPT-4o while Llama 4 offers 30x cost reduction at respectable capability
- RAG adopted in 78% of production systems as standard architecture, while agentic AI enables multi-step autonomous task completion with 30% of orgs exploring deployment
- Constitutional AI reduced harmful outputs by 85% vs 2023, but hallucination remains critical challenge requiring explicit mitigation strategies in production
Quick Recommendations
General-purpose tasks requiring frontier capability
GPT-5.1 or Claude 3.5 Sonnet
Most consistent performance across benchmarks (92% MMLU for GPT-5.1, 88.9% for Claude). Battle-tested production infrastructure and SLAs justify premium cost for critical applications.
Cost-sensitive deployments with data privacy needs
DeepSeek-V3 or Llama 4 Scout
DeepSeek-V3 matches GPT-4o performance (88.5 MMLU) with local deployment. Llama 4 costs 30x less ($0.1/1M tokens) with 86% MMLU - exceptional value for budget constraints.
Machine translation for production
Claude 3.5 Sonnet for general, DeepL for critical content
Claude achieved 78% professional 'good' ratings, highest among LLMs. DeepL hybrid approach requires 2-3x fewer corrections for publication-ready translation despite narrower language coverage.
Asian language processing and multimodal tasks
Qwen 3 (72B) or Qwen3-VL (235B)
Maintains 95% terminology accuracy for Asian technical content. Qwen3-VL rivals GPT-V on vision benchmarks with superior 20x visual compression at 97% OCR accuracy.
Edge deployment and resource-constrained environments
DistilBERT, Mistral-7B, or quantized Llama 3
DistilBERT retains 97% GLUE performance at 40% parameter reduction. Quantized Llama 3 (2-4 bit) runs on mobile devices while handling straightforward QA and dialogue effectively.
Enterprise knowledge management with proprietary data
RAG architecture with Claude/GPT-5 + semantic search
78% of production systems use RAG for good reason - enables knowledge updates without retraining, stronger hallucination mitigation through grounding, and maintains data privacy.
Complex reasoning requiring step-by-step verification
Test-time compute scaling with o1 or similar
Allocating compute during inference enables smaller models to outperform 14x larger models on complex problems. 4x efficiency improvement through compute-optimal allocation strategies.
Multi-step autonomous task execution
Agentic frameworks with GPT-5 or Claude as orchestrator
Agentic AI handles complex workflows requiring planning, execution, and iteration. Capital One achieved 5x latency reduction, Salesforce closed 18k deals since October 2024 launch.
Tasks & Benchmarks
Polish LLM General
General-purpose evaluation of language models on Polish language tasks: sentiment, reading comprehension, question answering, cyberbullying detection, and emotional intelligence.
Polish Cultural Competency
Evaluating language models on Polish linguistic and cultural knowledge across art & entertainment, culture & tradition, geography, grammar, history, and vocabulary.
Polish Text Understanding
Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky questions, and hallucination resistance.
Polish Conversation Quality
Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing.
Polish Emotional Intelligence
Evaluating language models on emotional intelligence in Polish: understanding emotional states, predicting emotional responses, and nuanced sentiment analysis.
Text Summarization
Text summarization compresses documents while preserving key information — a task that became dramatically more capable with LLMs but also harder to evaluate. PEGASUS (2020) and BART set the encoder-decoder baseline, but GPT-4 and Claude produce summaries that human evaluators often prefer over reference summaries, breaking ROUGE as a meaningful metric. CNN/DailyMail and XSum remain standard benchmarks, but the field is moving toward long-document summarization (books, legal filings, earnings calls) where 100K+ token context windows are finally making single-pass summarization feasible. The core unsolved problem is faithfulness — even frontier models hallucinate facts in roughly 5-15% of summaries, making factual consistency the critical metric that separates production-ready from demo-ready.
Text Classification
Text classification is the gateway drug of NLP — sentiment analysis, spam detection, topic labeling — and the task where transformers first proved their dominance over LSTMs. BERT (2018) set the template, but the real revolution came when instruction-tuned LLMs like GPT-4 and Llama 3 started matching fine-tuned classifiers zero-shot, threatening to make task-specific training obsolete. SST-2, AG News, and IMDB remain standard benchmarks, though the field increasingly cares about multilingual and low-resource performance where English-centric models still stumble. The open question: does a 70B parameter model doing classification via prompting actually beat a 100M fine-tuned encoder when you factor in latency and cost?
Question Answering
Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (2016) to the adversarial complexity of Natural Questions and TriviaQA. Human parity on SQuAD 2.0 was claimed by ALBERT in 2020, effectively saturating the benchmark — but real-world QA over noisy documents, multi-hop reasoning (HotpotQA, MuSiQue), and long-context grounding remain far from solved. The paradigm has shifted from standalone QA models to retrieval-augmented generation (RAG), where the bottleneck moved from answer extraction to retrieval quality. Modern systems like Perplexity and Google's AI Overviews show that production QA is now an end-to-end pipeline problem, not a single-model benchmark.
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Named Entity Recognition
Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from unstructured text, making it foundational to knowledge graphs, financial compliance, and clinical NLP. CoNLL-2003 English F1 scores have been above 93% since BERT, and current leaders like UniNER and GLiNER push past 95%, but these numbers mask the real difficulty: nested entities, emerging entity types, and cross-lingual transfer where performance drops 10-20 points. The shift from sequence labeling to generative NER (framing extraction as text generation) has opened the door for LLMs to compete, though latency-sensitive production systems still rely on encoder models like DeBERTa-v3 and SpanBERT.
Feature Extraction
Feature extraction — generating dense vector embeddings from text — is the unsung infrastructure layer powering semantic search, RAG pipelines, clustering, and recommendation systems. Sentence-BERT (2019) made it practical, but the field exploded in 2023-2024 with instruction-tuned embedding models like E5-Mistral, GTE-Qwen2, and Nomic Embed that turned decoder-only LLMs into embedding engines, pushing MTEB scores past 70 average across 50+ tasks. The key insight was that pre-training scale transfers to embedding quality — a 7B parameter embedding model crushes a 110M one on zero-shot retrieval. Matryoshka representation learning (Kusupati et al., 2022) added the ability to truncate embeddings to any dimension without retraining, making deployment flexible across latency and storage budgets.
Machine Translation
Machine translation is the oldest AI grand challenge, from rule-based systems in the 1950s to the transformer revolution sparked by "Attention Is All You Need" (2017) — literally the architecture that now powers all of AI. Google's multilingual T5 and Meta's NLLB-200 pushed translation to 200+ languages, but the real disruption came from GPT-4 and Claude matching or beating specialized MT systems on WMT benchmarks for high-resource pairs like English-German and English-Chinese. The unsolved frontier is low-resource languages (under 1M parallel sentences), where dedicated models like NLLB still dominate, and literary translation where preserving style, humor, and cultural nuance remains beyond any system. BLEU scores are increasingly seen as unreliable — human evaluation and newer metrics like COMET and BLEURT are becoming the standard.
Semantic Textual Similarity
Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detection, paraphrase mining, and retrieval. STS Benchmark scores climbed from 70 (GloVe averages) to 86+ with Sentence-BERT, and now exceed 92 with models like GTE-Qwen2 and E5-Mistral that leverage billion-parameter backbones. The real shift was from symmetric similarity (are these two sentences paraphrases?) to asymmetric retrieval (does this passage answer this query?), driven by the RAG revolution that made embedding quality a production-critical metric. Cross-lingual semantic similarity remains a hard frontier — models trained primarily on English still lose 5-10 points when comparing sentences across language families, despite multilingual pretraining.
Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.
Fill-Mask
Fill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict what goes there. It powered the encoder revolution that dominated NLP from 2018 to 2022 and remains the training signal behind models like RoBERTa, DeBERTa, and XLM-RoBERTa that still run most production classification and NER systems. As a standalone task it has limited direct applications, but probing what a model predicts for masked slots became a key technique for analyzing bias, factual knowledge, and linguistic competence stored in model weights. The task has faded from the research spotlight as decoder-only (GPT-style) pretraining proved more scalable, but encoder models trained with MLM remain the most cost-efficient option for tasks that need fast inference on structured prediction.
Zero-Shot Classification
Zero-shot classification asks a model to categorize text into labels it has never been explicitly trained on — the ultimate test of language understanding and generalization. The breakthrough was the natural language inference (NLI) trick: reframe classification as "does this text entail the label?" using models fine-tuned on MNLI, pioneered by Yin et al. (2019) and popularized by BART-large-MNLI. Today, instruction-tuned LLMs have largely subsumed this approach — GPT-4, Claude, and Llama 3 can classify into arbitrary taxonomies via prompting with near-supervised accuracy. The remaining challenge is consistency and calibration: LLMs are powerful but their predictions can be brittle to prompt phrasing, making them unreliable for high-stakes automated pipelines without careful engineering.
Reading Comprehension
Understanding and answering questions about passages.
Language Modeling
Language modeling — predicting the next token — is the pretraining objective that accidentally became the foundation of modern AI. From GPT-2's "too dangerous to release" moment in 2019 to GPT-4, Claude, Llama 3, and Gemini, scaling language models has produced emergent capabilities no one predicted from loss curves alone. Perplexity on benchmarks like WikiText-103 and Penn Treebank is essentially a historical artifact now; the field evaluates via downstream tasks (MMLU, HumanEval, MATH) because raw perplexity stopped correlating with usefulness years ago. The frontier has moved to mixture-of-experts architectures (Mixtral, DeepSeek-V3), longer context windows (1M+ tokens), and efficient inference — the model is no longer the bottleneck, serving it is.
Show all datasets and SOTA results
Polish LLM General
Polish Cultural Competency
Polish Text Understanding
Polish Conversation Quality
Polish Emotional Intelligence
Text Summarization
Text Classification
Question Answering
Natural Language Inference
Named Entity Recognition
Feature Extraction
Machine Translation
Semantic Textual Similarity
Table Question Answering
Fill-Mask
Zero-Shot Classification
Reading Comprehension
Language Modeling
Honest Takes
Benchmark scores are mostly contamination
Models perform substantially better on problems released before their training cutoff than after. GLUE and SuperGLUE are saturated. Focus on domain-specific evaluation on your actual production data, not leaderboard rankings.
Longer context doesn't mean better reasoning
Models now support 2M token windows but reasoning degrades even when retrieval succeeds. Don't assume more context improves performance - test carefully as adding documents can hurt accuracy.
RAG beats fine-tuning for most use cases
Unless you have domain-specific style requirements or truly unique knowledge patterns, RAG provides better ROI. Knowledge updates without retraining, lower cost, and retrieval quality matters more than model size.
Edge deployment is viable now
90% size reduction through quantization while maintaining 95%+ accuracy enables capable models on mobile devices. Edge deployment grew 340% in 2025 - privacy and latency benefits outweigh cloud convenience for many applications.
Specialized models still win on translation
While LLMs won 9 of 11 WMT24 language pairs, hybrid approaches like DeepL require 2-3x fewer editorial corrections. For production translation at scale, domain-specific models justify the added complexity.
Get notified when these results update
New models drop weekly. We track them so you don't have to.