Text Classification
Classify text into predefined categories. Powers spam detection, sentiment analysis, topic categorization, and content moderation.
How Text Classification Works
A technical deep-dive into text classification. From fine-tuned BERT to zero-shot and LLM-based approaches.
Classification Types
Three fundamental types based on output structure.
Two classes (yes/no, spam/not spam)
One of N mutually exclusive classes
Multiple labels can be true simultaneously
Classification Pipeline
Approaches Comparison
Four main approaches with different trade-offs.
| Approach | Accuracy | Speed | Data Needed | Cost |
|---|---|---|---|---|
| Fine-tuned BERT | High (92-98%) | Fast (1-10ms) | 100-10K examples | Low (self-hosted) |
| Zero-Shot (NLI) | Medium (70-85%) | Medium (50-200ms) | 0 examples | Low (self-hosted) |
| Few-Shot (SetFit) | High (85-95%) | Fast (5-20ms) | 8-16 examples | Low (self-hosted) |
| LLM Prompting | High (85-95%) | Slow (500-2000ms) | 0-5 examples | High ($1-10/1K) |
- - You have 100+ labeled examples per class
- - Categories are fixed and well-defined
- - You need fast, cheap inference at scale
- - You have no labeled data
- - Categories may change frequently
- - Quick prototyping / exploration
- - You have 8-50 examples per class
- - Need accuracy close to fine-tuning
- - Want fast training (minutes)
- - You need explanations with predictions
- - Complex, nuanced categories
- - Low volume, high value decisions
Model Evolution
From bag-of-words to transformers and LLMs.
Benchmarks
Standard datasets for evaluating text classifiers.
| Dataset | Task | Size | Metric | SOTA |
|---|---|---|---|---|
| SST-2 | Sentiment | 67K | Accuracy | 97.5% (DeBERTa) |
| IMDB | Sentiment | 50K | Accuracy | 96.2% (DeBERTa) |
| AG News | Topic | 120K | Accuracy | 95.5% (BERT-large) |
| MNLI | NLI | 433K | Accuracy | 92.0% (DeBERTa) |
| GoEmotions | Emotion | 58K | F1 | 52% (RoBERTa) |
The standard NLU benchmarks. GLUE includes SST-2, MNLI, QQP, and more. SuperGLUE has harder tasks. DeBERTa-v3 leads both as of 2024. Human baseline on SuperGLUE: 89.8%, models now exceed this.
Code Examples
Get started with text classification in Python.
from transformers import pipeline
# Load pre-trained sentiment classifier
classifier = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
# Classify text
texts = [
"I love this product! It exceeded my expectations.",
"Terrible experience. Would not recommend.",
"It's okay, nothing special."
]
results = classifier(texts)
for text, result in zip(texts, results):
print(f"{result['label']}: {result['score']:.3f} - {text[:50]}...")
# Output:
# POSITIVE: 0.999 - I love this product! It exceeded my expectations.
# NEGATIVE: 0.999 - Terrible experience. Would not recommend.
# NEGATIVE: 0.796 - It's okay, nothing special.Quick Reference
- - Fine-tune DistilBERT
- - or DeBERTa-v3 for best accuracy
- - SetFit if limited data
- - Zero-shot (BART-MNLI)
- - or LLM with prompting
- - Validate before training
- - Accuracy (balanced classes)
- - F1 (imbalanced classes)
- - AUC-ROC (ranking)
Use Cases
- ✓Sentiment analysis
- ✓Spam detection
- ✓Topic categorization
- ✓Content moderation
- ✓Intent classification
Architectural Patterns
Fine-tuned Encoder
Fine-tune BERT/RoBERTa on labeled data.
- +High accuracy
- +Fast inference
- +Well-understood
- -Needs labeled data
- -Fixed categories
Zero-Shot Classification
Use NLI models to classify without training examples.
- +No training needed
- +Flexible categories
- -Lower accuracy
- -Slower inference
LLM-Based
Use LLMs with prompting for classification.
- +Highly flexible
- +Can explain decisions
- -Expensive
- -Slower
- -May be inconsistent
Implementations
API Services
OpenAI Moderation
OpenAIContent moderation API. Fast and free.
Open Source
Benchmarks
Quick Facts
- Input
- Text
- Output
- Structured Data
- Implementations
- 4 open source, 1 API
- Patterns
- 3 approaches