Text Classification
Assign labels to text automatically. From sentiment analysis to intent detection - the foundation of many NLP pipelines.
What is Text Classification?
Text classification assigns predefined categories to text. Given an input text, the model outputs a label (or multiple labels) with confidence scores.
This is one of the most practical NLP tasks - it powers spam filters, customer support routing, content moderation, and sentiment dashboards.
Sentiment Analysis
Classify text as positive, negative, or neutral. Used for product reviews, social media monitoring, brand tracking.
Topic Classification
Categorize documents by subject. News articles, support tickets, research papers.
Intent Detection
Understand user goals in conversational AI. What does the user want to do?
Spam Detection
Filter unwanted content. Email spam, comment moderation, abuse detection.
Two Approaches: Zero-shot vs Fine-tuned
You have two main options for text classification. The choice depends on your use case, accuracy requirements, and available data.
Zero-shot Classification
- +No training data needed
- +Works immediately with any labels
- +Easy to change categories on the fly
- -Lower accuracy than fine-tuned
- -Higher latency (LLM inference)
Best for: Prototyping, dynamic categories, low-volume
Fine-tuned Classifiers
- +Highest accuracy for your domain
- +Fast inference (milliseconds)
- +Works offline, no API costs
- -Requires labeled training data
- -Retraining needed for new categories
Best for: Production, high-volume, fixed categories
Zero-shot Classification
Zero-shot classification uses a model trained on natural language inference (NLI) to classify text into any categories you provide - no training required.
Zero-shot with BART-MNLI
Hugging Facefrom transformers import pipeline
classifier = pipeline(
"zero-shot-classification",
model="facebook/bart-large-mnli"
)
result = classifier(
"I love this product! Best purchase ever.",
candidate_labels=["positive", "negative", "neutral"]
)
print(result)
# {'sequence': 'I love this product! Best purchase ever.',
# 'labels': ['positive', 'negative', 'neutral'],
# 'scores': [0.9845, 0.0098, 0.0057]}Zero-shot with OpenAI
GPT-4from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[{
"role": "user",
"content": """Classify this text as positive/negative/neutral.
Respond with only the label.
Text: "The service was okay, nothing special."
Classification:"""
}]
)
print(response.choices[0].message.content)
# neutralHow Zero-shot Works
BART-MNLI was trained on natural language inference - determining if a hypothesis follows from a premise. For classification, it checks: "Does 'I love this product' entail 'This text is positive'?" The entailment score becomes the classification confidence.
Fine-tuned Classifiers
For production systems with high accuracy requirements, use models fine-tuned specifically for classification. These are fast, accurate, and cost-effective at scale.
Sentiment Analysis with DistilBERT
SST-2 Fine-tunedfrom transformers import pipeline
sentiment = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
result = sentiment("This movie was terrible.")
print(result)
# [{'label': 'NEGATIVE', 'score': 0.9998}]
# Batch processing
texts = [
"I love this!",
"This is awful.",
"It's okay I guess."
]
results = sentiment(texts)
for text, res in zip(texts, results):
print(f"{text} -> {res['label']} ({res['score']:.2f})")Topic Classification with RoBERTa
Multi-classfrom transformers import pipeline
# Topic classification (news categories)
topic_classifier = pipeline(
"text-classification",
model="cardiffnlp/twitter-roberta-base-topic-single"
)
result = topic_classifier("Bitcoin hits new all-time high as investors pile in")
print(result)
# [{'label': 'business_&_finance', 'score': 0.87}]Popular Pre-trained Models
| Model | Task | Speed |
|---|---|---|
| distilbert-base-uncased-finetuned-sst-2-english | Sentiment (2-class) | Very Fast |
| cardiffnlp/twitter-roberta-base-sentiment | Sentiment (3-class) | Fast |
| facebook/bart-large-mnli | Zero-shot any labels | Medium |
| MoritzLaurer/DeBERTa-v3-large-mnli | Zero-shot (SOTA) | Slow |
Multi-class vs Multi-label
Multi-class
Exactly one label per text. Labels are mutually exclusive.
Example: Sentiment
Text -> [positive OR negative OR neutral]
Multi-label
Zero or more labels per text. Labels are independent.
Example: Article Tags
Text -> [tech AND finance AND breaking]
Multi-label Classification
Zero-shotfrom transformers import pipeline
classifier = pipeline(
"zero-shot-classification",
model="facebook/bart-large-mnli"
)
result = classifier(
"Apple announces new AI chip for data centers",
candidate_labels=["technology", "business", "science", "politics"],
multi_label=True # Key parameter!
)
print(result)
# {'labels': ['technology', 'business', 'science', 'politics'],
# 'scores': [0.95, 0.82, 0.23, 0.02]}Confidence Scores and Thresholds
Classification models output probability scores. Setting appropriate thresholds is crucial for production systems.
Threshold Strategy
- -High threshold (0.9+): High precision, fewer false positives. Use for automated actions (auto-delete spam, auto-route tickets).
- -Medium threshold (0.7-0.9): Balanced. Use for suggestions that humans review.
- -Low threshold (0.5-0.7): High recall, catches more. Use for flagging content for review.
Applying Thresholds
Pythonfrom transformers import pipeline
classifier = pipeline("sentiment-analysis")
def classify_with_threshold(text, threshold=0.85):
result = classifier(text)[0]
if result['score'] >= threshold:
return result['label'], result['score']
else:
return 'UNCERTAIN', result['score']
# High confidence
print(classify_with_threshold("I love this!"))
# ('POSITIVE', 0.9998)
# Low confidence - needs human review
print(classify_with_threshold("It's fine I guess."))
# ('UNCERTAIN', 0.7234)Benchmark: GLUE and SST-2
The Stanford Sentiment Treebank (SST-2) is the standard benchmark for binary sentiment classification. It's part of the larger GLUE and SuperGLUE benchmark suites.
SST-2 Accuracy
SST-2 binary sentiment classification accuracy. Higher is better. Human baseline is ~97%.
Key Insight
Fine-tuned models beat zero-shot on benchmarks, but GPT-4 zero-shot comes surprisingly close (~95%) without any training. For production, fine-tuned DistilBERT gives you 91.3% accuracy at 10x lower latency and cost.
When to Use What
Prototyping / Exploring Categories
Use BART-MNLI zero-shot. Test different label sets instantly.
No training needed | Change labels anytime | ~200ms per text
Production Sentiment Analysis
Use DistilBERT fine-tuned on SST-2. Fast and accurate.
91.3% accuracy | 10ms inference | Run locally for free
Highest Accuracy (Budget Available)
Use GPT-4 with structured outputs. Best reasoning for edge cases.
~95% accuracy | Handles nuance | Good for ambiguous text
Custom Categories (Have Training Data)
Fine-tune DistilBERT or RoBERTa on your labeled data.
Best accuracy for your domain | Fast inference | Works offline
Multi-label Tagging
Use BART-MNLI with multi_label=True or fine-tune with sigmoid outputs.
Multiple tags per text | Set per-label thresholds | Flexible categories
Key Takeaways
- 1
Two main approaches: Zero-shot for flexibility (no training), fine-tuned for accuracy and speed.
- 2
DistilBERT is the workhorse: 91.3% SST-2 accuracy, millisecond inference, runs anywhere.
- 3
BART-MNLI for zero-shot: Classify into any labels without training. Great for prototyping.
- 4
Set appropriate thresholds: High for automation, low for flagging content for human review.
Practice Exercise
Try building a simple classifier:
- 1.Install transformers:
pip install transformers torch - 2.Run the DistilBERT sentiment example on 10 product reviews.
- 3.Try zero-shot classification with your own custom labels (e.g., "urgent", "question", "complaint").
- 4.Compare the speed difference between zero-shot and fine-tuned models.
Explore Text Classification Benchmarks
See how different models perform on text classification tasks:
View Text Classification Benchmarks ->