Home/Building Blocks/Text Classification
TextStructured Data

Text Classification

Classify text into predefined categories. Powers spam detection, sentiment analysis, topic categorization, and content moderation.

How Text Classification Works

A technical deep-dive into text classification. From fine-tuned BERT to zero-shot and LLM-based approaches.

1

Classification Types

Three fundamental types based on output structure.

Binary

Two classes (yes/no, spam/not spam)

Examples: Spam detection, sentiment (pos/neg), toxicity
Output: Single probability score
Multi-class

One of N mutually exclusive classes

Examples: Topic categorization, intent classification, language ID
Output: Softmax over N classes
Multi-label

Multiple labels can be true simultaneously

Examples: Tag prediction, content warnings, emotion detection
Output: Independent sigmoid per label

Classification Pipeline

Text
"Great product!"
->
Tokenize
[CLS] great product [SEP]
->
Encoder
BERT/RoBERTa
->
[CLS]
768-dim vector
->
Linear
Softmax -> class
2

Approaches Comparison

Four main approaches with different trade-offs.

ApproachAccuracySpeedData NeededCost
Fine-tuned BERTHigh (92-98%)Fast (1-10ms)100-10K examplesLow (self-hosted)
Zero-Shot (NLI)Medium (70-85%)Medium (50-200ms)0 examplesLow (self-hosted)
Few-Shot (SetFit)High (85-95%)Fast (5-20ms)8-16 examplesLow (self-hosted)
LLM PromptingHigh (85-95%)Slow (500-2000ms)0-5 examplesHigh ($1-10/1K)
Use Fine-tuned BERT when:
  • - You have 100+ labeled examples per class
  • - Categories are fixed and well-defined
  • - You need fast, cheap inference at scale
Use Zero-Shot when:
  • - You have no labeled data
  • - Categories may change frequently
  • - Quick prototyping / exploration
Use SetFit (Few-Shot) when:
  • - You have 8-50 examples per class
  • - Need accuracy close to fine-tuning
  • - Want fast training (minutes)
Use LLM when:
  • - You need explanations with predictions
  • - Complex, nuanced categories
  • - Low volume, high value decisions
3

Model Evolution

From bag-of-words to transformers and LLMs.

Bag of Words + SVM
2000
TraditionalTF-IDF features, linear classifier
Word2Vec + CNN
2014
NeuralKim CNN, word embeddings
LSTM/BiLSTM
2015
RNNSequential modeling, attention
BERT
2018
TransformerBidirectional, pre-trained, fine-tune
RoBERTa
2019
TransformerBetter pre-training, longer training
DistilBERT
2019
Transformer60% faster, 97% performance
DeBERTa-v3
2021
TransformerSOTA on GLUE/SuperGLUE
SetFit
2022
Few-Shot8-16 examples, sentence transformers
GPT-4 / Claude
2023
LLMZero-shot, in-context learning
Best Accuracy
DeBERTa-v3-large
91.7% on SuperGLUE
Best Speed/Accuracy
DistilBERT
60% faster, 97% performance
Best Few-Shot
SetFit
High accuracy with 8-16 examples
4

Benchmarks

Standard datasets for evaluating text classifiers.

DatasetTaskSizeMetricSOTA
SST-2Sentiment67KAccuracy97.5% (DeBERTa)
IMDBSentiment50KAccuracy96.2% (DeBERTa)
AG NewsTopic120KAccuracy95.5% (BERT-large)
MNLINLI433KAccuracy92.0% (DeBERTa)
GoEmotionsEmotion58KF152% (RoBERTa)
GLUE and SuperGLUE

The standard NLU benchmarks. GLUE includes SST-2, MNLI, QQP, and more. SuperGLUE has harder tasks. DeBERTa-v3 leads both as of 2024. Human baseline on SuperGLUE: 89.8%, models now exceed this.

5

Code Examples

Get started with text classification in Python.

Pre-trained BERTpip install transformers
Recommended Start
from transformers import pipeline

# Load pre-trained sentiment classifier
classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

# Classify text
texts = [
    "I love this product! It exceeded my expectations.",
    "Terrible experience. Would not recommend.",
    "It's okay, nothing special."
]

results = classifier(texts)
for text, result in zip(texts, results):
    print(f"{result['label']}: {result['score']:.3f} - {text[:50]}...")

# Output:
# POSITIVE: 0.999 - I love this product! It exceeded my expectations.
# NEGATIVE: 0.999 - Terrible experience. Would not recommend.
# NEGATIVE: 0.796 - It's okay, nothing special.

Quick Reference

For Production
  • - Fine-tune DistilBERT
  • - or DeBERTa-v3 for best accuracy
  • - SetFit if limited data
For Prototyping
  • - Zero-shot (BART-MNLI)
  • - or LLM with prompting
  • - Validate before training
Key Metrics
  • - Accuracy (balanced classes)
  • - F1 (imbalanced classes)
  • - AUC-ROC (ranking)

Use Cases

  • Sentiment analysis
  • Spam detection
  • Topic categorization
  • Content moderation
  • Intent classification

Architectural Patterns

Fine-tuned Encoder

Fine-tune BERT/RoBERTa on labeled data.

Pros:
  • +High accuracy
  • +Fast inference
  • +Well-understood
Cons:
  • -Needs labeled data
  • -Fixed categories

Zero-Shot Classification

Use NLI models to classify without training examples.

Pros:
  • +No training needed
  • +Flexible categories
Cons:
  • -Lower accuracy
  • -Slower inference

LLM-Based

Use LLMs with prompting for classification.

Pros:
  • +Highly flexible
  • +Can explain decisions
Cons:
  • -Expensive
  • -Slower
  • -May be inconsistent

Implementations

API Services

OpenAI Moderation

OpenAI
API

Content moderation API. Fast and free.

Open Source

SetFit

Apache 2.0
Open Source

Few-shot text classification. High accuracy with 8-16 examples.

DistilBERT

Apache 2.0
Open Source

Fast and efficient. Good starting point for fine-tuning.

DeBERTa-v3

MIT
Open Source

Top GLUE performance. Best accuracy for classification.

BART-large-mnli

Apache 2.0
Open Source

Zero-shot classification using NLI.

Benchmarks

Quick Facts

Input
Text
Output
Structured Data
Implementations
4 open source, 1 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for text classification.

Submit Results