Text Summarization
Text summarization compresses documents while preserving key information — a task that became dramatically more capable with LLMs but also harder to evaluate. PEGASUS (2020) and BART set the encoder-decoder baseline, but GPT-4 and Claude produce summaries that human evaluators often prefer over reference summaries, breaking ROUGE as a meaningful metric. CNN/DailyMail and XSum remain standard benchmarks, but the field is moving toward long-document summarization (books, legal filings, earnings calls) where 100K+ token context windows are finally making single-pass summarization feasible. The core unsolved problem is faithfulness — even frontier models hallucinate facts in roughly 5-15% of summaries, making factual consistency the critical metric that separates production-ready from demo-ready.
CNN/DailyMail
300K news articles with multi-sentence summaries. Standard benchmark for abstractive summarization.
Top 10
Leading models on CNN/DailyMail.
No results yet. Be the first to contribute.
All datasets
1 dataset tracked for this task.
Related tasks
Other tasks in Natural Language Processing.
Looking to run a model? HuggingFace hosts inference for this task type.