Home/Building Blocks/Text Summarization
TextText

Text Summarization

Condense long documents into concise summaries. Essential for news aggregation, research, and document processing.

How Text Summarization Works

From extractive highlighting to abstractive generation. How models learn to condense documents while preserving meaning.

1

The Fundamental Question: Copy or Generate?

Every summarization system must answer this: Should I select existing sentences, or write new ones? This choice shapes everything downstream.

ExtractiveCopy the important sentences

Identify and extract the most important sentences from the source document. The summary is a subset of the original text.

ANALOGY:
"Like highlighting key passages in a textbook"
Pros
  • + Always grammatical
  • + Faithful to source
  • + Fast
  • + No hallucinations
Cons
  • - Can feel choppy
  • - Limited compression
  • - May miss nuance
  • - Cannot paraphrase
Methods: TextRank, LexRank, BERT-Ext, MatchSum
AbstractiveWrite new sentences that capture the meaning

Generate new text that captures the key information from the source. The model can paraphrase, combine ideas, and use words not in the original.

ANALOGY:
"Like explaining a book to a friend in your own words"
Pros
  • + Natural flow
  • + High compression
  • + Can synthesize
  • + More flexible
Cons
  • - May hallucinate
  • - Can miss facts
  • - Slower
  • - Needs more data
Methods: BART, T5, Pegasus, LLMs

See the Difference

SOURCE DOCUMENT:
The Amazon rainforest produces 20% of the world's oxygen. It spans 9 countries and covers 5.5 million km2. Deforestation threatens this vital ecosystem.
EXTRACTIVE SUMMARY:
The Amazon rainforest produces 20% of the world's oxygen. It spans 9 countries and covers 5.5 million km2.Deforestation threatens this vital ecosystem.
Highlighted sentences copied verbatim
ABSTRACTIVE SUMMARY:
The Amazon rainforest, spanning 9 countries and producing 20% of Earth's oxygen, faces threats from deforestation.
New sentence combining multiple facts
2

How Abstractive Summarization Works

The encoder-decoder architecture: read the full document, build understanding, then generate a compressed version.

The Encoder-Decoder Pipeline

1
Tokenize Input
Break the source document into tokens
"The Amazon rainforest..." -> [The, Amazon, rain, forest, ...]
2
Encode
Transform tokens into contextual representations. Each token becomes a vector that knows about its neighbors.
[tokens] -> Encoder -> [h1, h2, h3, ...]
3
Cross-Attention
The decoder attends to relevant parts of the encoded input when generating each output token.
Decoder queries: 'What should I say about oxygen?' -> Attends to 'produces 20% oxygen'
4
Decode (Autoregressive)
Generate summary tokens one at a time, each conditioned on previous outputs and the encoded input.
[START] -> 'The' -> 'Amazon' -> '...' -> [END]

Cross-Attention: The Key to Summarization

When generating each word of the summary, the decoder "looks back" at the encoded document. It learns which parts of the source are relevant for the current output position.

Encoder:
The
Amazon
produces
20%
oxygen
...
deforestation
Decoder:
The
Amazon
generates
20%
of
Earth's
O2

When generating "generates", the model attends strongly to "produces", "20%", and "oxygen" from the source. Note how it paraphrases ("generates" instead of "produces") while preserving meaning.

Why Encoder-Decoder?
  • - Encoder builds full document understanding before any generation
  • - Decoder can attend to any part of the document at any time
  • - Natural fit for compression: many inputs to few outputs
Why Not Decoder-Only (like GPT)?
  • - Works fine with instruction prompting ("summarize:")
  • - But: document must fit in context with summary space
  • - No separate encoding step means less efficient attention
3

Key Models

From specialized summarizers to general-purpose LLMs.

BART(2019)Facebook AI
Bidirectional and Auto-Regressive Transformers
Encoder-Decoder1024 tokens
TRAINING
Denoising autoencoder (mask, delete, shuffle)
STRENGTHS
Great general summarizer, good for news
NOTABLE
Pre-trained on corruption tasks, then fine-tuned
Model: facebook/bart-large-cnn

Which Model Should You Use?

Fast + Good Enough
BART-CNN
News articles, short docs. Self-hosted.
Long Documents
LED / LongT5
Papers, reports. Up to 16K tokens.
Best Quality
Claude / GPT-4
Any length. Controllable style. API cost.
4

Long Document Strategies

When your document exceeds the model's context window, you have options.

1
Chunking

Split document into chunks that fit the model's context window. Summarize each chunk, then optionally summarize the summaries.

Doc -> [Chunk1, Chunk2, Chunk3] -> [Sum1, Sum2, Sum3] -> Final Summary
Pros
  • + Works with any model
  • + Simple to implement
  • + Parallelizable
Cons
  • - Loses cross-chunk context
  • - Quality depends on chunk boundaries
  • - Multi-step latency
Best For
Very long documents with clear sections
2
Hierarchical

Build a tree structure: summarize paragraphs, then sections, then the whole document.

Paragraphs -> Section summaries -> Document summary
Pros
  • + Preserves structure
  • + Handles very long docs
  • + Natural for reports
Cons
  • - Complex pipeline
  • - Error propagation
  • - Needs document structure
Best For
Structured documents like papers, reports
3
Sliding Window (LED/Longformer)

Use sparse attention patterns: local attention for nearby tokens, global attention for special tokens.

Each token attends to: [local window] + [global tokens]
Pros
  • + Single pass
  • + Maintains global context
  • + O(n) not O(n2)
Cons
  • - Still limited context
  • - Special model required
  • - May miss distant relations
Best For
Documents up to 16K tokens
4
LLM Long Context

Modern LLMs (Claude 200K, GPT-4 128K) can process entire documents in one pass.

Full document -> LLM -> Summary
Pros
  • + Sees everything at once
  • + No information loss
  • + Simple
Cons
  • - Expensive
  • - Attention dilution possible
  • - Context limits still exist
Best For
Documents up to 100K+ tokens, when quality matters most

Context Length in Perspective

BART (1K)
~2 pages
LED (16K)
~30 pages
GPT-4 (128K)
~250 pages
Claude (200K)
~400 pages

Approximate page counts assuming ~500 tokens per page. Actual varies with formatting.

5

Benchmarks & Evaluation

Standard datasets and the ROUGE metrics used to measure summarization quality.

Understanding ROUGE Scores

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between the generated summary and human-written reference summaries.

ROUGE-1
Unigram overlap
Pred: 'The cat sat' vs Ref: 'The dog sat' -> 2/3 words match
ROUGE-2
Bigram overlap
'The cat' and 'cat sat' vs 'The dog' and 'dog sat' -> 0/2 match
ROUGE-L
Longest common subsequence
'The sat' appears in both -> LCS length = 2
ROUGE Limitations
ROUGE measures word overlap, not semantic quality. A summary can have high ROUGE by copying phrases but miss the main point. For production, combine with human evaluation or neural metrics like BERTScore.
DatasetTypeSizeAvg LengthSummary StyleSOTA (R1/R2/RL)
CNN/DailyMailNews300K articles800 wordsMulti-sentence highlightsPEGASUS: 44.2 / 21.5 / 41.1
XSumExtreme Summary227K articles400 wordsSingle sentencePEGASUS: 47.2 / 24.6 / 39.3
arXivScientific Papers215K papers6K wordsAbstractLED: 46.6 / 19.6 / 42.0
PubMedMedical133K abstracts3K wordsAbstractLED: 45.5 / 19.1 / 41.0
MultiNewsMulti-Document56K clusters2K words (10 docs)Comprehensive summaryPRIMERA: 49.9 / 21.1 / 25.9
6

Code Examples

From quick BART inference to hierarchical summarization pipelines.

BART (Quick Start)pip install transformers
Recommended Start
from transformers import pipeline

# Load BART fine-tuned on CNN/DailyMail
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

article = """
The Amazon rainforest, often referred to as the "lungs of the Earth,"
produces approximately 20% of the world's oxygen. Spanning nine countries
and covering 5.5 million square kilometers, it is the largest tropical
rainforest on the planet. However, deforestation rates have accelerated
dramatically in recent years, threatening not only biodiversity but also
global climate stability. Scientists warn that losing the Amazon could
trigger irreversible climate tipping points.
"""

summary = summarizer(
    article,
    max_length=60,
    min_length=20,
    do_sample=False
)

print(summary[0]['summary_text'])
# Output: The Amazon rainforest produces 20% of the world's oxygen.
# Deforestation threatens biodiversity and global climate stability.

End-to-End Example

See how different approaches handle the same news article.

SOURCE ARTICLE (127 words):
The European Space Agency's Rosetta spacecraft has made history by successfully landing a probe on a comet for the first time. The Philae lander touched down on Comet 67P/Churyumov-Gerasimenko after a seven-hour descent from its mother ship. Scientists hope the mission will help unlock secrets about the origins of the solar system and possibly life on Earth. The comet is currently 510 million kilometers from Earth, traveling at 18 kilometers per second. The landing was extremely challenging due to the comet's weak gravity and irregular shape. Philae is equipped with instruments to analyze the comet's composition, including drills to sample the surface material. The mission has been 10 years in the making and cost approximately 1.4 billion euros.
Extractive
35 words
The European Space Agency's Rosetta spacecraft has made history by successfully landing a probe on a comet for the first time. Scientists hope the mission will help unlock secrets about the origins of the solar system and possibly life on Earth.
Copied key sentences verbatim
Extractive
35 words
The European Space Agency's Rosetta spacecraft has made history by successfully landing a probe on a comet for the first time. Scientists hope the mission will help unlock secrets about the origins of the solar system and possibly life on Earth.
Copied key sentences verbatim
Abstractive (BART)
28 words
ESA's Rosetta mission achieved the first-ever comet landing, placing the Philae probe on Comet 67P after a decade-long, 1.4 billion euro mission aimed at understanding the solar system's origins.
Compressed and rephrased
LLM (Claude)
52 words
The European Space Agency landed a probe on a comet for the first time in history. The Philae lander, part of the decade-long Rosetta mission, touched down on Comet 67P to study the origins of the solar system. The 1.4 billion euro mission overcame significant challenges including the comet's weak gravity and extreme distance from Earth.
More comprehensive, natural

Quick Reference

For Production
  • - BART-CNN for news/short docs
  • - LED for papers/reports (16K)
  • - Claude/GPT-4 for best quality
Long Documents
  • - Under 16K: use LED directly
  • - Under 200K: use Claude directly
  • - Longer: chunk + hierarchical
Evaluation
  • - ROUGE for quick comparison
  • - BERTScore for semantic
  • - Human eval for production

Use Cases

  • News summarization
  • Research paper digests
  • Meeting notes
  • Legal document summaries
  • Email tl;dr

Architectural Patterns

Extractive Summarization

Select important sentences from the source.

Pros:
  • +Faithful to source
  • +Fast
  • +No hallucination
Cons:
  • -Less fluent
  • -Can't paraphrase
  • -Fixed to source text

Abstractive Summarization

Generate new text that captures the meaning.

Pros:
  • +Fluent output
  • +Can condense more
  • +Natural reading
Cons:
  • -May hallucinate
  • -Slower
  • -Needs more compute

LLM Summarization

Use large language models with summarization prompts.

Pros:
  • +Handles any format
  • +Controllable style
  • +Long context
Cons:
  • -Expensive
  • -May miss details
  • -Inconsistent

Implementations

API Services

Claude

Anthropic
API

200K context. Excellent for long documents.

GPT-4o

OpenAI
API

128K context. Great instruction following.

Open Source

BART-large-cnn

Apache 2.0
Open Source

Classic news summarization. Fast and reliable.

Pegasus

Apache 2.0
Open Source

Optimized for summarization. Good abstractive quality.

LongT5

Apache 2.0
Open Source

Long document summarization. 16K+ tokens.

Benchmarks

Quick Facts

Input
Text
Output
Text
Implementations
3 open source, 2 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for text summarization.

Submit Results