Level 1: Single Blocks~15 min

Text Classification

Assign labels to text automatically. From sentiment analysis to intent detection - the foundation of many NLP pipelines.

What is Text Classification?

Text classification assigns predefined categories to text. Given an input text, the model outputs a label (or multiple labels) with confidence scores.

This is one of the most practical NLP tasks - it powers spam filters, customer support routing, content moderation, and sentiment dashboards.

Sentiment Analysis

Classify text as positive, negative, or neutral. Used for product reviews, social media monitoring, brand tracking.

"Love this product!" -> positive (0.98)

Topic Classification

Categorize documents by subject. News articles, support tickets, research papers.

"Fed raises rates..." -> finance (0.92)

Intent Detection

Understand user goals in conversational AI. What does the user want to do?

"Cancel my order" -> cancel_order (0.95)

Spam Detection

Filter unwanted content. Email spam, comment moderation, abuse detection.

"You won $1M..." -> spam (0.99)

Two Approaches: Zero-shot vs Fine-tuned

You have two main options for text classification. The choice depends on your use case, accuracy requirements, and available data.

Zero-shot Classification

  • +No training data needed
  • +Works immediately with any labels
  • +Easy to change categories on the fly
  • -Lower accuracy than fine-tuned
  • -Higher latency (LLM inference)

Best for: Prototyping, dynamic categories, low-volume

Fine-tuned Classifiers

  • +Highest accuracy for your domain
  • +Fast inference (milliseconds)
  • +Works offline, no API costs
  • -Requires labeled training data
  • -Retraining needed for new categories

Best for: Production, high-volume, fixed categories

Zero-shot Classification

Zero-shot classification uses a model trained on natural language inference (NLI) to classify text into any categories you provide - no training required.

Zero-shot with BART-MNLI

Hugging Face
from transformers import pipeline

classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)

result = classifier(
    "I love this product! Best purchase ever.",
    candidate_labels=["positive", "negative", "neutral"]
)

print(result)
# {'sequence': 'I love this product! Best purchase ever.',
#  'labels': ['positive', 'negative', 'neutral'],
#  'scores': [0.9845, 0.0098, 0.0057]}

Zero-shot with OpenAI

GPT-4
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{
        "role": "user",
        "content": """Classify this text as positive/negative/neutral.
Respond with only the label.

Text: "The service was okay, nothing special."
Classification:"""
    }]
)

print(response.choices[0].message.content)
# neutral

How Zero-shot Works

BART-MNLI was trained on natural language inference - determining if a hypothesis follows from a premise. For classification, it checks: "Does 'I love this product' entail 'This text is positive'?" The entailment score becomes the classification confidence.

Fine-tuned Classifiers

For production systems with high accuracy requirements, use models fine-tuned specifically for classification. These are fast, accurate, and cost-effective at scale.

Sentiment Analysis with DistilBERT

SST-2 Fine-tuned
from transformers import pipeline

sentiment = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

result = sentiment("This movie was terrible.")
print(result)
# [{'label': 'NEGATIVE', 'score': 0.9998}]

# Batch processing
texts = [
    "I love this!",
    "This is awful.",
    "It's okay I guess."
]
results = sentiment(texts)
for text, res in zip(texts, results):
    print(f"{text} -> {res['label']} ({res['score']:.2f})")

Topic Classification with RoBERTa

Multi-class
from transformers import pipeline

# Topic classification (news categories)
topic_classifier = pipeline(
    "text-classification",
    model="cardiffnlp/twitter-roberta-base-topic-single"
)

result = topic_classifier("Bitcoin hits new all-time high as investors pile in")
print(result)
# [{'label': 'business_&_finance', 'score': 0.87}]

Popular Pre-trained Models

ModelTaskSpeed
distilbert-base-uncased-finetuned-sst-2-englishSentiment (2-class)Very Fast
cardiffnlp/twitter-roberta-base-sentimentSentiment (3-class)Fast
facebook/bart-large-mnliZero-shot any labelsMedium
MoritzLaurer/DeBERTa-v3-large-mnliZero-shot (SOTA)Slow

Multi-class vs Multi-label

Multi-class

Exactly one label per text. Labels are mutually exclusive.

Example: Sentiment

Text -> [positive OR negative OR neutral]

Multi-label

Zero or more labels per text. Labels are independent.

Example: Article Tags

Text -> [tech AND finance AND breaking]

Multi-label Classification

Zero-shot
from transformers import pipeline

classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)

result = classifier(
    "Apple announces new AI chip for data centers",
    candidate_labels=["technology", "business", "science", "politics"],
    multi_label=True  # Key parameter!
)

print(result)
# {'labels': ['technology', 'business', 'science', 'politics'],
#  'scores': [0.95, 0.82, 0.23, 0.02]}

Confidence Scores and Thresholds

Classification models output probability scores. Setting appropriate thresholds is crucial for production systems.

Threshold Strategy

  • -High threshold (0.9+): High precision, fewer false positives. Use for automated actions (auto-delete spam, auto-route tickets).
  • -Medium threshold (0.7-0.9): Balanced. Use for suggestions that humans review.
  • -Low threshold (0.5-0.7): High recall, catches more. Use for flagging content for review.

Applying Thresholds

Python
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

def classify_with_threshold(text, threshold=0.85):
    result = classifier(text)[0]

    if result['score'] >= threshold:
        return result['label'], result['score']
    else:
        return 'UNCERTAIN', result['score']

# High confidence
print(classify_with_threshold("I love this!"))
# ('POSITIVE', 0.9998)

# Low confidence - needs human review
print(classify_with_threshold("It's fine I guess."))
# ('UNCERTAIN', 0.7234)

Benchmark: GLUE and SST-2

The Stanford Sentiment Treebank (SST-2) is the standard benchmark for binary sentiment classification. It's part of the larger GLUE and SuperGLUE benchmark suites.

SST-2 Accuracy

RoBERTa-large
96.4%
GPT-4 (zero-shot)
95%
DeBERTa-v3-base
94.8%
BERT-large
93.5%
DistilBERT
91.3%
BART-MNLI (zero-shot)
88%

SST-2 binary sentiment classification accuracy. Higher is better. Human baseline is ~97%.

Key Insight

Fine-tuned models beat zero-shot on benchmarks, but GPT-4 zero-shot comes surprisingly close (~95%) without any training. For production, fine-tuned DistilBERT gives you 91.3% accuracy at 10x lower latency and cost.

When to Use What

Prototyping / Exploring Categories

Use BART-MNLI zero-shot. Test different label sets instantly.

No training needed | Change labels anytime | ~200ms per text

Production Sentiment Analysis

Use DistilBERT fine-tuned on SST-2. Fast and accurate.

91.3% accuracy | 10ms inference | Run locally for free

Highest Accuracy (Budget Available)

Use GPT-4 with structured outputs. Best reasoning for edge cases.

~95% accuracy | Handles nuance | Good for ambiguous text

Custom Categories (Have Training Data)

Fine-tune DistilBERT or RoBERTa on your labeled data.

Best accuracy for your domain | Fast inference | Works offline

Multi-label Tagging

Use BART-MNLI with multi_label=True or fine-tune with sigmoid outputs.

Multiple tags per text | Set per-label thresholds | Flexible categories

Key Takeaways

  • 1

    Two main approaches: Zero-shot for flexibility (no training), fine-tuned for accuracy and speed.

  • 2

    DistilBERT is the workhorse: 91.3% SST-2 accuracy, millisecond inference, runs anywhere.

  • 3

    BART-MNLI for zero-shot: Classify into any labels without training. Great for prototyping.

  • 4

    Set appropriate thresholds: High for automation, low for flagging content for human review.

Practice Exercise

Try building a simple classifier:

  1. 1.Install transformers: pip install transformers torch
  2. 2.Run the DistilBERT sentiment example on 10 product reviews.
  3. 3.Try zero-shot classification with your own custom labels (e.g., "urgent", "question", "complaint").
  4. 4.Compare the speed difference between zero-shot and fine-tuned models.

Explore Text Classification Benchmarks

See how different models perform on text classification tasks:

View Text Classification Benchmarks ->