Codesota - NLP - Text SummarizationCNN/DailyMail - XSum - factuality evalTask page

00 - Text Summarization

Text summarization task router

Summarization compresses text, but the real requirement is usually fidelity. Pick extractive, abstractive, or long-context summarization based on whether missing details, invented facts, or style drift are the biggest risk.

Benchmark

CNN/DailyMail - XSum - factuality eval

Current pick

Claude 4 / GPT-5

01 - Explainer

What this task measures.

Summarization turns long input into a shorter artifact, but the output contract changes by use case. News summarization rewards compression and fluency; meeting and legal summaries need coverage; enterprise summaries need source-grounded facts, citations, and explicit handling of uncertainty.

02 - Benchmarks

Use a benchmark ladder.

One leaderboard rarely captures the task. Use the canonical benchmark for lineage, then add harder or more domain-specific checks before choosing a model.

Benchmark	Role	Metric	Caveat
CNN/DailyMail	Classic news summarization	ROUGE	Useful for lineage; weak proxy for long-context, factual, or domain-specific summaries.
XSum	Abstractive stress test	ROUGE / human eval	Encourages concise rewriting and can reward unsupported abstraction.
SummEval / QAGS	Quality and factuality	Coherence / consistency / answerability	Better quality signal, but still smaller than real enterprise document sets.
Local source-grounded eval	Production gate	Claim support / coverage / omission rate	Needed when missed obligations or invented facts are expensive.

03 - Evaluation

What to compare.

The public benchmark is a shortlist signal. Production choice still depends on latency, cost, domain drift, and how expensive mistakes are.

Axis	Value	Why it matters
Classic benchmark	CNN/DailyMail	Good for news-style compression, weak for modern enterprise documents.
Abstractive stress test	XSum	Tests concise rewriting but can reward unsupported abstraction.
Production metric	Factual consistency + coverage	ROUGE is not enough; check missing obligations and hallucinated claims.
Failure mode	Confident omission	The summary sounds good while dropping the one fact the user needed.

04 - Routing

Pick by task shape.

Must not hallucinate

Extractive summary

Select source sentences and preserve auditability.

Readable executive brief

LLM abstractive summary

Better structure and tone, but needs factual checks.

Very long documents

Map-reduce or long-context LLM

Chunking and coverage tracking prevent important sections from disappearing.

Legal or medical memo

Summary + citation verifier

Every key claim should map back to source spans.

05 - Related

Need implementation details?

Open the lower-level explainer for architecture, code examples, and implementation options.

Open summarization explainer ->