Natural Language Processing

Processing and understanding text? Evaluate your models on language understanding, generation, translation, and information extraction benchmarks.

19 tasks23 datasets5982 results

NLP in 2025 has matured from research systems to production infrastructure. Frontier models reach 95% on MMLU while solving alignment problems through Constitutional AI. The field now balances raw capability with practical deployment concerns.

State of the Field (2025)

Frontier models (GPT-5.1, Claude 3.5, DeepSeek-V3) achieve 88-92% on MMLU with context windows expanding to 2M tokens, though long-context reasoning still degrades with scale
Open-source models now match proprietary performance - DeepSeek-V3 (88.5 MMLU) rivals GPT-4o while Llama 4 offers 30x cost reduction at respectable capability
RAG adopted in 78% of production systems as standard architecture, while agentic AI enables multi-step autonomous task completion with 30% of orgs exploring deployment
Constitutional AI reduced harmful outputs by 85% vs 2023, but hallucination remains critical challenge requiring explicit mitigation strategies in production

Quick Recommendations

General-purpose tasks requiring frontier capability

GPT-5.1 or Claude 3.5 Sonnet

Most consistent performance across benchmarks (92% MMLU for GPT-5.1, 88.9% for Claude). Battle-tested production infrastructure and SLAs justify premium cost for critical applications.

Cost-sensitive deployments with data privacy needs

DeepSeek-V3 or Llama 4 Scout

DeepSeek-V3 matches GPT-4o performance (88.5 MMLU) with local deployment. Llama 4 costs 30x less ($0.1/1M tokens) with 86% MMLU - exceptional value for budget constraints.

Machine translation for production

Claude 3.5 Sonnet for general, DeepL for critical content

Claude achieved 78% professional 'good' ratings, highest among LLMs. DeepL hybrid approach requires 2-3x fewer corrections for publication-ready translation despite narrower language coverage.

Asian language processing and multimodal tasks

Qwen 3 (72B) or Qwen3-VL (235B)

Maintains 95% terminology accuracy for Asian technical content. Qwen3-VL rivals GPT-V on vision benchmarks with superior 20x visual compression at 97% OCR accuracy.

Edge deployment and resource-constrained environments

DistilBERT, Mistral-7B, or quantized Llama 3

DistilBERT retains 97% GLUE performance at 40% parameter reduction. Quantized Llama 3 (2-4 bit) runs on mobile devices while handling straightforward QA and dialogue effectively.

Enterprise knowledge management with proprietary data

RAG architecture with Claude/GPT-5 + semantic search

78% of production systems use RAG for good reason - enables knowledge updates without retraining, stronger hallucination mitigation through grounding, and maintains data privacy.

Complex reasoning requiring step-by-step verification

Test-time compute scaling with o1 or similar

Allocating compute during inference enables smaller models to outperform 14x larger models on complex problems. 4x efficiency improvement through compute-optimal allocation strategies.

Multi-step autonomous task execution

Agentic frameworks with GPT-5 or Claude as orchestrator

Agentic AI handles complex workflows requiring planning, execution, and iteration. Capital One achieved 5x latency reduction, Salesforce closed 18k deals since October 2024 launch.

Tasks & Benchmarks

Polish LLM General

General-purpose evaluation of language models on Polish language tasks: sentiment, reading comprehension, question answering, cyberbullying detection, and emotional intelligence.

1 datasets3728 resultsSOTA tracked

Polish Cultural Competency

Evaluating language models on Polish linguistic and cultural knowledge across art & entertainment, culture & tradition, geography, grammar, history, and vocabulary.

1 datasets1155 resultsSOTA tracked

Polish Text Understanding

Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky questions, and hallucination resistance.

1 datasets465 resultsSOTA tracked

Polish Conversation Quality

Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing.

1 datasets450 resultsSOTA tracked

Polish Emotional Intelligence

Evaluating language models on emotional intelligence in Polish: understanding emotional states, predicting emotional responses, and nuanced sentiment analysis.

1 datasets101 resultsSOTA tracked

Text Summarization

Text summarization compresses documents while preserving key information — a task that became dramatically more capable with LLMs but also harder to evaluate. PEGASUS (2020) and BART set the encoder-decoder baseline, but GPT-4 and Claude produce summaries that human evaluators often prefer over reference summaries, breaking ROUGE as a meaningful metric. CNN/DailyMail and XSum remain standard benchmarks, but the field is moving toward long-document summarization (books, legal filings, earnings calls) where 100K+ token context windows are finally making single-pass summarization feasible. The core unsolved problem is faithfulness — even frontier models hallucinate facts in roughly 5-15% of summaries, making factual consistency the critical metric that separates production-ready from demo-ready.

1 datasets15 resultsSOTA tracked

Text Classification

Text classification is the gateway drug of NLP — sentiment analysis, spam detection, topic labeling — and the task where transformers first proved their dominance over LSTMs. BERT (2018) set the template, but the real revolution came when instruction-tuned LLMs like GPT-4 and Llama 3 started matching fine-tuned classifiers zero-shot, threatening to make task-specific training obsolete. SST-2, AG News, and IMDB remain standard benchmarks, though the field increasingly cares about multilingual and low-resource performance where English-centric models still stumble. The open question: does a 70B parameter model doing classification via prompting actually beat a 100M fine-tuned encoder when you factor in latency and cost?

2 datasets14 resultsSOTA tracked

Question Answering

Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (2016) to the adversarial complexity of Natural Questions and TriviaQA. Human parity on SQuAD 2.0 was claimed by ALBERT in 2020, effectively saturating the benchmark — but real-world QA over noisy documents, multi-hop reasoning (HotpotQA, MuSiQue), and long-context grounding remain far from solved. The paradigm has shifted from standalone QA models to retrieval-augmented generation (RAG), where the bottleneck moved from answer extraction to retrieval quality. Modern systems like Perplexity and Google's AI Overviews show that production QA is now an end-to-end pipeline problem, not a single-model benchmark.

1 datasets9 resultsSOTA tracked

Natural Language Inference

Determining entailment relationships between sentences (SNLI, MNLI).

1 datasets8 resultsSOTA tracked

Text Ranking

Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.

2 datasets8 resultsSOTA tracked

Named Entity Recognition

Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from unstructured text, making it foundational to knowledge graphs, financial compliance, and clinical NLP. CoNLL-2003 English F1 scores have been above 93% since BERT, and current leaders like UniNER and GLiNER push past 95%, but these numbers mask the real difficulty: nested entities, emerging entity types, and cross-lingual transfer where performance drops 10-20 points. The shift from sequence labeling to generative NER (framing extraction as text generation) has opened the door for LLMs to compete, though latency-sensitive production systems still rely on encoder models like DeBERTa-v3 and SpanBERT.

1 datasets7 resultsSOTA tracked

Feature Extraction

Feature extraction — generating dense vector embeddings from text — is the unsung infrastructure layer powering semantic search, RAG pipelines, clustering, and recommendation systems. Sentence-BERT (2019) made it practical, but the field exploded in 2023-2024 with instruction-tuned embedding models like E5-Mistral, GTE-Qwen2, and Nomic Embed that turned decoder-only LLMs into embedding engines, pushing MTEB scores past 70 average across 50+ tasks. The key insight was that pre-training scale transfers to embedding quality — a 7B parameter embedding model crushes a 110M one on zero-shot retrieval. Matryoshka representation learning (Kusupati et al., 2022) added the ability to truncate embeddings to any dimension without retraining, making deployment flexible across latency and storage budgets.

1 datasets6 resultsSOTA tracked

Machine Translation

Machine translation is the oldest AI grand challenge, from rule-based systems in the 1950s to the transformer revolution sparked by "Attention Is All You Need" (2017) — literally the architecture that now powers all of AI. Google's multilingual T5 and Meta's NLLB-200 pushed translation to 200+ languages, but the real disruption came from GPT-4 and Claude matching or beating specialized MT systems on WMT benchmarks for high-resource pairs like English-German and English-Chinese. The unsolved frontier is low-resource languages (under 1M parallel sentences), where dedicated models like NLLB still dominate, and literary translation where preserving style, humor, and cultural nuance remains beyond any system. BLEU scores are increasingly seen as unreliable — human evaluation and newer metrics like COMET and BLEURT are becoming the standard.

2 datasets4 resultsSOTA tracked

Semantic Textual Similarity

Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detection, paraphrase mining, and retrieval. STS Benchmark scores climbed from 70 (GloVe averages) to 86+ with Sentence-BERT, and now exceed 92 with models like GTE-Qwen2 and E5-Mistral that leverage billion-parameter backbones. The real shift was from symmetric similarity (are these two sentences paraphrases?) to asymmetric retrieval (does this passage answer this query?), driven by the RAG revolution that made embedding quality a production-critical metric. Cross-lingual semantic similarity remains a hard frontier — models trained primarily on English still lose 5-10 points when comparing sentences across language families, despite multilingual pretraining.

1 datasets3 resultsSOTA tracked

Table Question Answering

Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.

2 datasets3 resultsSOTA tracked

Fill-Mask

Fill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict what goes there. It powered the encoder revolution that dominated NLP from 2018 to 2022 and remains the training signal behind models like RoBERTa, DeBERTa, and XLM-RoBERTa that still run most production classification and NER systems. As a standalone task it has limited direct applications, but probing what a model predicts for masked slots became a key technique for analyzing bias, factual knowledge, and linguistic competence stored in model weights. The task has faded from the research spotlight as decoder-only (GPT-style) pretraining proved more scalable, but encoder models trained with MLM remain the most cost-efficient option for tasks that need fast inference on structured prediction.

1 datasets3 resultsSOTA tracked

Zero-Shot Classification

Zero-shot classification asks a model to categorize text into labels it has never been explicitly trained on — the ultimate test of language understanding and generalization. The breakthrough was the natural language inference (NLI) trick: reframe classification as "does this text entail the label?" using models fine-tuned on MNLI, pioneered by Yin et al. (2019) and popularized by BART-large-MNLI. Today, instruction-tuned LLMs have largely subsumed this approach — GPT-4, Claude, and Llama 3 can classify into arbitrary taxonomies via prompting with near-supervised accuracy. The remaining challenge is consistency and calibration: LLMs are powerful but their predictions can be brittle to prompt phrasing, making them unreliable for high-stakes automated pipelines without careful engineering.

1 datasets3 resultsSOTA tracked

Reading Comprehension

Understanding and answering questions about passages.

1 datasets0 results

Language Modeling

Language modeling — predicting the next token — is the pretraining objective that accidentally became the foundation of modern AI. From GPT-2's "too dangerous to release" moment in 2019 to GPT-4, Claude, Llama 3, and Gemini, scaling language models has produced emergent capabilities no one predicted from loss curves alone. Perplexity on benchmarks like WikiText-103 and Penn Treebank is essentially a historical artifact now; the field evaluates via downstream tasks (MMLU, HumanEval, MATH) because raw perplexity stopped correlating with usefulness years ago. The frontier has moved to mixture-of-experts architectures (Mixtral, DeepSeek-V3), longer context windows (1M+ tokens), and efficient inference — the model is no longer the bottleneck, serving it is.

1 datasets0 results

Show all datasets and SOTA results

Polish LLM General

Open PL LLM LeaderboardOpen Polish LLM Leaderboard2025

93.44(belebele)Meta-Llama-3.1-405B-Instruct-FP8

Polish Cultural Competency

PLCCPolish Linguistic and Cultural Competency Benchmark2025

100(geography)Gemini-3.1-Pro-Preview

Polish Text Understanding

CPTU-BenchComplex Polish Text Understanding Benchmark2025

4.702247(tricky-questions)Qwen/Qwen3.5-35B-A3B thinking (API)

Polish Conversation Quality

Polish MT-BenchPolish Multi-Turn Benchmark2025

10(humanities)Mistral-Small-3.1-24B-Instruct-2503

Polish Emotional Intelligence

Polish EQ-BenchPolish Emotional Intelligence Benchmark (EQ-Bench v2 PL)2025

78.07(eq-score)Mistral-Large-Instruct-2407

Text Summarization

CNN/DailyMailCNN/DailyMail Summarization2015

47.78(rouge-1)BRIO

Text Classification

GLUEGeneral Language Understanding Evaluation2018

91.8(average-score)DeBERTa-v3-large

SuperGLUESuperGLUE2019

91.4(average-score)DeBERTa-v3-large

Question Answering

SQuAD v2.0Stanford Question Answering Dataset v2.02018

91.4(f1)DeBERTa-v3-large

Natural Language Inference

SNLIStanford Natural Language Inference2015

92.6(accuracy)GPT-4o

Text Ranking

BEIR2021

62.65(ndcg@10)NV-Embed-v2

MS MARCO2016

41.8(mrr@10)RankLLaMA-7B

Named Entity Recognition

CoNLL-2003CoNLL-2003 Named Entity Recognition2003

93.8(f1)GLiNER-multitask

Feature Extraction

MTEB Leaderboard2022

72.31(avg-score)NV-Embed-v2

Machine Translation

FLORES-2002022

WMT'232023

84.1(comet)GPT-4

Semantic Textual Similarity

STS Benchmark2017

88.4(spearman)GTE-Qwen2-7B-instruct

Table Question Answering

SQA2017

WikiTableQuestions2015

75.3(accuracy)GPT-4

Fill-Mask

GLUE2018

91.37(avg-score)DeBERTa-v3-large

Zero-Shot Classification

XNLI2018

87.4(accuracy)GPT-4

Reading Comprehension

RACEReAding Comprehension from Examinations2017

Language Modeling

WikiText Perplexity2016

Honest Takes

Benchmark scores are mostly contamination

Models perform substantially better on problems released before their training cutoff than after. GLUE and SuperGLUE are saturated. Focus on domain-specific evaluation on your actual production data, not leaderboard rankings.

Longer context doesn't mean better reasoning

Models now support 2M token windows but reasoning degrades even when retrieval succeeds. Don't assume more context improves performance - test carefully as adding documents can hurt accuracy.

RAG beats fine-tuning for most use cases

Unless you have domain-specific style requirements or truly unique knowledge patterns, RAG provides better ROI. Knowledge updates without retraining, lower cost, and retrieval quality matters more than model size.

Edge deployment is viable now

90% size reduction through quantization while maintaining 95%+ accuracy enables capable models on mobile devices. Edge deployment grew 340% in 2025 - privacy and latency benefits outweigh cloud convenience for many applications.

Specialized models still win on translation

While LLMs won 9 of 11 WMT24 language pairs, hybrid approaches like DeepL require 2-3x fewer editorial corrections. For production translation at scale, domain-specific models justify the added complexity.

Get notified when these results update

New models drop weekly. We track them so you don't have to.