Text→Text

Text Summarization

Condense long documents into concise summaries. Essential for news aggregation, research, and document processing.

How Text Summarization Works

From extractive highlighting to abstractive generation. How models learn to condense documents while preserving meaning.

1. Extract vs Abstract 2. How It Works 3. Key Models 4. Long Documents 5. Benchmarks 6. Code

The Fundamental Question: Copy or Generate?

Every summarization system must answer this: Should I select existing sentences, or write new ones? This choice shapes everything downstream.

ExtractiveCopy the important sentences

Identify and extract the most important sentences from the source document. The summary is a subset of the original text.

ANALOGY:

"Like highlighting key passages in a textbook"

Pros

+ Always grammatical
+ Faithful to source
+ Fast
+ No hallucinations

Cons

- Can feel choppy
- Limited compression
- May miss nuance
- Cannot paraphrase

Methods: TextRank, LexRank, BERT-Ext, MatchSum

AbstractiveWrite new sentences that capture the meaning

Generate new text that captures the key information from the source. The model can paraphrase, combine ideas, and use words not in the original.

ANALOGY:

"Like explaining a book to a friend in your own words"

Pros

+ Natural flow
+ High compression
+ Can synthesize
+ More flexible

Cons

- May hallucinate
- Can miss facts
- Slower
- Needs more data

Methods: BART, T5, Pegasus, LLMs

See the Difference

SOURCE DOCUMENT:

The Amazon rainforest produces 20% of the world's oxygen. It spans 9 countries and covers 5.5 million km2. Deforestation threatens this vital ecosystem.

EXTRACTIVE SUMMARY:

The Amazon rainforest produces 20% of the world's oxygen. It spans 9 countries and covers 5.5 million km2.Deforestation threatens this vital ecosystem.

Highlighted sentences copied verbatim

ABSTRACTIVE SUMMARY:

The Amazon rainforest, spanning 9 countries and producing 20% of Earth's oxygen, faces threats from deforestation.

New sentence combining multiple facts

How Abstractive Summarization Works

The encoder-decoder architecture: read the full document, build understanding, then generate a compressed version.

The Encoder-Decoder Pipeline

Tokenize Input

Break the source document into tokens

"The Amazon rainforest..." -> [The, Amazon, rain, forest, ...]

Encode

Transform tokens into contextual representations. Each token becomes a vector that knows about its neighbors.

[tokens] -> Encoder -> [h1, h2, h3, ...]

Cross-Attention

The decoder attends to relevant parts of the encoded input when generating each output token.

Decoder queries: 'What should I say about oxygen?' -> Attends to 'produces 20% oxygen'

Decode (Autoregressive)

Generate summary tokens one at a time, each conditioned on previous outputs and the encoded input.

[START] -> 'The' -> 'Amazon' -> '...' -> [END]

Cross-Attention: The Key to Summarization

When generating each word of the summary, the decoder "looks back" at the encoded document. It learns which parts of the source are relevant for the current output position.

Encoder:

The

Amazon

produces

20%

oxygen

...

deforestation

Decoder:

The

Amazon

generates

20%

Earth's

When generating "generates", the model attends strongly to "produces", "20%", and "oxygen" from the source. Note how it paraphrases ("generates" instead of "produces") while preserving meaning.

Why Encoder-Decoder?

- Encoder builds full document understanding before any generation
- Decoder can attend to any part of the document at any time
- Natural fit for compression: many inputs to few outputs

Why Not Decoder-Only (like GPT)?

- Works fine with instruction prompting ("summarize:")
- But: document must fit in context with summary space
- No separate encoding step means less efficient attention

Key Models

From specialized summarizers to general-purpose LLMs.

BART(2019)Facebook AI

Bidirectional and Auto-Regressive Transformers

Encoder-Decoder1024 tokens

TRAINING

Denoising autoencoder (mask, delete, shuffle)

STRENGTHS

Great general summarizer, good for news

NOTABLE

Pre-trained on corruption tasks, then fine-tuned

Model: facebook/bart-large-cnn

Which Model Should You Use?

Fast + Good Enough

BART-CNN

News articles, short docs. Self-hosted.

Long Documents

LED / LongT5

Papers, reports. Up to 16K tokens.

Best Quality

Claude / GPT-4

Any length. Controllable style. API cost.

Long Document Strategies

When your document exceeds the model's context window, you have options.

Chunking

Split document into chunks that fit the model's context window. Summarize each chunk, then optionally summarize the summaries.

Doc -> [Chunk1, Chunk2, Chunk3] -> [Sum1, Sum2, Sum3] -> Final Summary

Pros

+ Works with any model
+ Simple to implement
+ Parallelizable

Cons

- Loses cross-chunk context
- Quality depends on chunk boundaries
- Multi-step latency

Best For

Very long documents with clear sections

Hierarchical

Build a tree structure: summarize paragraphs, then sections, then the whole document.

Paragraphs -> Section summaries -> Document summary

Pros

+ Preserves structure
+ Handles very long docs
+ Natural for reports

Cons

- Complex pipeline
- Error propagation
- Needs document structure

Best For

Structured documents like papers, reports

Sliding Window (LED/Longformer)

Use sparse attention patterns: local attention for nearby tokens, global attention for special tokens.

Each token attends to: [local window] + [global tokens]

Pros

+ Single pass
+ Maintains global context
+ O(n) not O(n2)

Cons

- Still limited context
- Special model required
- May miss distant relations

Best For

Documents up to 16K tokens

LLM Long Context

Modern LLMs (Claude 200K, GPT-4 128K) can process entire documents in one pass.

Full document -> LLM -> Summary

Pros

+ Sees everything at once
+ No information loss
+ Simple

Cons

- Expensive
- Attention dilution possible
- Context limits still exist

Best For

Documents up to 100K+ tokens, when quality matters most

Context Length in Perspective

BART (1K)

~2 pages

LED (16K)

~30 pages

GPT-4 (128K)

~250 pages

Claude (200K)

~400 pages

Approximate page counts assuming ~500 tokens per page. Actual varies with formatting.

Benchmarks & Evaluation

Standard datasets and the ROUGE metrics used to measure summarization quality.

Understanding ROUGE Scores

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between the generated summary and human-written reference summaries.

ROUGE-1

Unigram overlap

Pred: 'The cat sat' vs Ref: 'The dog sat' -> 2/3 words match

ROUGE-2

Bigram overlap

'The cat' and 'cat sat' vs 'The dog' and 'dog sat' -> 0/2 match

ROUGE-L

Longest common subsequence

'The sat' appears in both -> LCS length = 2

ROUGE Limitations

ROUGE measures word overlap, not semantic quality. A summary can have high ROUGE by copying phrases but miss the main point. For production, combine with human evaluation or neural metrics like BERTScore.

Dataset	Type	Size	Avg Length	Summary Style	SOTA (R1/R2/RL)
CNN/DailyMail	News	300K articles	800 words	Multi-sentence highlights	PEGASUS: 44.2 / 21.5 / 41.1
XSum	Extreme Summary	227K articles	400 words	Single sentence	PEGASUS: 47.2 / 24.6 / 39.3
arXiv	Scientific Papers	215K papers	6K words	Abstract	LED: 46.6 / 19.6 / 42.0
PubMed	Medical	133K abstracts	3K words	Abstract	LED: 45.5 / 19.1 / 41.0
MultiNews	Multi-Document	56K clusters	2K words (10 docs)	Comprehensive summary	PRIMERA: 49.9 / 21.1 / 25.9

Code Examples

From quick BART inference to hierarchical summarization pipelines.

BART (Quick Start)pip install transformers

Recommended Start

from transformers import pipeline

# Load BART fine-tuned on CNN/DailyMail
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

article = """
The Amazon rainforest, often referred to as the "lungs of the Earth,"
produces approximately 20% of the world's oxygen. Spanning nine countries
and covering 5.5 million square kilometers, it is the largest tropical
rainforest on the planet. However, deforestation rates have accelerated
dramatically in recent years, threatening not only biodiversity but also
global climate stability. Scientists warn that losing the Amazon could
trigger irreversible climate tipping points.
"""

summary = summarizer(
    article,
    max_length=60,
    min_length=20,
    do_sample=False
)

print(summary[0]['summary_text'])
# Output: The Amazon rainforest produces 20% of the world's oxygen.
# Deforestation threatens biodiversity and global climate stability.

End-to-End Example

See how different approaches handle the same news article.

SOURCE ARTICLE (127 words):

The European Space Agency's Rosetta spacecraft has made history by successfully landing a probe on a comet for the first time. The Philae lander touched down on Comet 67P/Churyumov-Gerasimenko after a seven-hour descent from its mother ship. Scientists hope the mission will help unlock secrets about the origins of the solar system and possibly life on Earth. The comet is currently 510 million kilometers from Earth, traveling at 18 kilometers per second. The landing was extremely challenging due to the comet's weak gravity and irregular shape. Philae is equipped with instruments to analyze the comet's composition, including drills to sample the surface material. The mission has been 10 years in the making and cost approximately 1.4 billion euros.

Extractive

35 words

The European Space Agency's Rosetta spacecraft has made history by successfully landing a probe on a comet for the first time. Scientists hope the mission will help unlock secrets about the origins of the solar system and possibly life on Earth.

Copied key sentences verbatim

Extractive

35 words

Copied key sentences verbatim

Abstractive (BART)

28 words

ESA's Rosetta mission achieved the first-ever comet landing, placing the Philae probe on Comet 67P after a decade-long, 1.4 billion euro mission aimed at understanding the solar system's origins.

Compressed and rephrased

LLM (Claude)

52 words

The European Space Agency landed a probe on a comet for the first time in history. The Philae lander, part of the decade-long Rosetta mission, touched down on Comet 67P to study the origins of the solar system. The 1.4 billion euro mission overcame significant challenges including the comet's weak gravity and extreme distance from Earth.

More comprehensive, natural

Quick Reference

For Production

- BART-CNN for news/short docs
- LED for papers/reports (16K)
- Claude/GPT-4 for best quality

Long Documents

- Under 16K: use LED directly
- Under 200K: use Claude directly
- Longer: chunk + hierarchical

Evaluation

- ROUGE for quick comparison
- BERTScore for semantic
- Human eval for production

Use Cases

✓News summarization
✓Research paper digests
✓Meeting notes
✓Legal document summaries
✓Email tl;dr

Architectural Patterns

Extractive Summarization

Select important sentences from the source.

Pros:

+Faithful to source
+Fast
+No hallucination

Cons:

-Less fluent
-Can't paraphrase
-Fixed to source text

Abstractive Summarization

Generate new text that captures the meaning.

Pros:

+Fluent output
+Can condense more
+Natural reading

Cons:

-May hallucinate
-Slower
-Needs more compute

LLM Summarization

Use large language models with summarization prompts.

Pros:

+Handles any format
+Controllable style
+Long context

Cons:

-Expensive
-May miss details
-Inconsistent

Implementations

API Services

Claude

Anthropic

API

200K context. Excellent for long documents.

GPT-4o

OpenAI

API

128K context. Great instruction following.

Open Source

BART-large-cnn

Apache 2.0

Open Source

Classic news summarization. Fast and reliable.

HuggingFace

Pegasus

Apache 2.0

Open Source

Optimized for summarization. Good abstractive quality.

HuggingFace

LongT5

Apache 2.0

Open Source

Long document summarization. 16K+ tokens.

HuggingFace

Benchmarks

CNN/DailyMail →XSum →

Quick Facts

Input: Text
Output: Text
Implementations: 3 open source, 2 API
Patterns: 3 approaches

Related Blocks

Language Model

Text → Text

Document Extraction

Document → Structured Data

Have benchmark data?

Help us track the state of the art for text summarization.

Submit Results