Home/Building Blocks/Text Classification

Text→Structured Data

Text Classification

Classify text into predefined categories. Powers spam detection, sentiment analysis, topic categorization, and content moderation.

How Text Classification Works

A technical deep-dive into text classification. From fine-tuned BERT to zero-shot and LLM-based approaches.

1. Classification Types 2. Approaches 3. Models 4. Benchmarks 5. Code

Classification Types

Three fundamental types based on output structure.

Binary

Two classes (yes/no, spam/not spam)

Examples: Spam detection, sentiment (pos/neg), toxicity

Output: Single probability score

Multi-class

One of N mutually exclusive classes

Examples: Topic categorization, intent classification, language ID

Output: Softmax over N classes

Multi-label

Multiple labels can be true simultaneously

Examples: Tag prediction, content warnings, emotion detection

Output: Independent sigmoid per label

Classification Pipeline

Text

"Great product!"

Tokenize

[CLS] great product [SEP]

Encoder

BERT/RoBERTa

[CLS]

768-dim vector

Linear

Softmax -> class

Approaches Comparison

Four main approaches with different trade-offs.

Approach	Accuracy	Speed	Data Needed	Cost
Fine-tuned BERT	High (92-98%)	Fast (1-10ms)	100-10K examples	Low (self-hosted)
Zero-Shot (NLI)	Medium (70-85%)	Medium (50-200ms)	0 examples	Low (self-hosted)
Few-Shot (SetFit)	High (85-95%)	Fast (5-20ms)	8-16 examples	Low (self-hosted)
LLM Prompting	High (85-95%)	Slow (500-2000ms)	0-5 examples	High ($1-10/1K)

Use Fine-tuned BERT when:

- You have 100+ labeled examples per class
- Categories are fixed and well-defined
- You need fast, cheap inference at scale

Use Zero-Shot when:

- You have no labeled data
- Categories may change frequently
- Quick prototyping / exploration

Use SetFit (Few-Shot) when:

- You have 8-50 examples per class
- Need accuracy close to fine-tuning
- Want fast training (minutes)

Use LLM when:

- You need explanations with predictions
- Complex, nuanced categories
- Low volume, high value decisions

Model Evolution

From bag-of-words to transformers and LLMs.

Bag of Words + SVM

2000

TraditionalTF-IDF features, linear classifier

Word2Vec + CNN

2014

NeuralKim CNN, word embeddings

LSTM/BiLSTM

2015

RNNSequential modeling, attention

BERT

2018

TransformerBidirectional, pre-trained, fine-tune

RoBERTa

2019

TransformerBetter pre-training, longer training

DistilBERT

2019

Transformer60% faster, 97% performance

DeBERTa-v3

2021

TransformerSOTA on GLUE/SuperGLUE

SetFit

2022

Few-Shot8-16 examples, sentence transformers

GPT-4 / Claude

2023

LLMZero-shot, in-context learning

Best Accuracy

DeBERTa-v3-large

91.7% on SuperGLUE

Best Speed/Accuracy

DistilBERT

60% faster, 97% performance

Best Few-Shot

SetFit

High accuracy with 8-16 examples

Benchmarks

Standard datasets for evaluating text classifiers.

Dataset	Task	Size	Metric	SOTA
SST-2	Sentiment	67K	Accuracy	97.5% (DeBERTa)
IMDB	Sentiment	50K	Accuracy	96.2% (DeBERTa)
AG News	Topic	120K	Accuracy	95.5% (BERT-large)
MNLI	NLI	433K	Accuracy	92.0% (DeBERTa)
GoEmotions	Emotion	58K	F1	52% (RoBERTa)

GLUE and SuperGLUE

The standard NLU benchmarks. GLUE includes SST-2, MNLI, QQP, and more. SuperGLUE has harder tasks. DeBERTa-v3 leads both as of 2024. Human baseline on SuperGLUE: 89.8%, models now exceed this.

Code Examples

Get started with text classification in Python.

Pre-trained BERTpip install transformers

Recommended Start

from transformers import pipeline

# Load pre-trained sentiment classifier
classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

# Classify text
texts = [
    "I love this product! It exceeded my expectations.",
    "Terrible experience. Would not recommend.",
    "It's okay, nothing special."
]

results = classifier(texts)
for text, result in zip(texts, results):
    print(f"{result['label']}: {result['score']:.3f} - {text[:50]}...")

# Output:
# POSITIVE: 0.999 - I love this product! It exceeded my expectations.
# NEGATIVE: 0.999 - Terrible experience. Would not recommend.
# NEGATIVE: 0.796 - It's okay, nothing special.

Quick Reference

For Production

- Fine-tune DistilBERT
- or DeBERTa-v3 for best accuracy
- SetFit if limited data

For Prototyping

- Zero-shot (BART-MNLI)
- or LLM with prompting
- Validate before training

Key Metrics

- Accuracy (balanced classes)
- F1 (imbalanced classes)
- AUC-ROC (ranking)

Use Cases

✓Sentiment analysis
✓Spam detection
✓Topic categorization
✓Content moderation
✓Intent classification

Architectural Patterns

Fine-tuned Encoder

Fine-tune BERT/RoBERTa on labeled data.

Pros:

+High accuracy
+Fast inference
+Well-understood

Cons:

-Needs labeled data
-Fixed categories

Zero-Shot Classification

Use NLI models to classify without training examples.

Pros:

+No training needed
+Flexible categories

Cons:

-Lower accuracy
-Slower inference

LLM-Based

Use LLMs with prompting for classification.

Pros:

+Highly flexible
+Can explain decisions

Cons:

-Expensive
-Slower
-May be inconsistent

Implementations

API Services

OpenAI Moderation

OpenAI

API

Content moderation API. Fast and free.

Open Source

SetFit

Apache 2.0

Open Source

Few-shot text classification. High accuracy with 8-16 examples.

GitHub

DistilBERT

Apache 2.0

Open Source

Fast and efficient. Good starting point for fine-tuning.

HuggingFace

DeBERTa-v3

MIT

Open Source

Top GLUE performance. Best accuracy for classification.

HuggingFace

BART-large-mnli

Apache 2.0

Open Source

Zero-shot classification using NLI.

HuggingFace

Benchmarks

GLUE →SuperGLUE →

Quick Facts

Input: Text
Output: Structured Data
Implementations: 4 open source, 1 API
Patterns: 3 approaches

Related Blocks

Have benchmark data?

Help us track the state of the art for text classification.

Submit Results