Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota - NLP - Text ClassificationGLUE - SuperGLUE - domain evalTask page
00 - Text Classification

Text classification task router

Classification turns text into labels: intent, topic, sentiment, risk, moderation category, or routing decision. GLUE and SuperGLUE are historical reference points; your label set and class imbalance are the real test.

Benchmark
GLUE - SuperGLUE - domain eval
Current pick
DeBERTa v3
intenttopicrisk
01 - Explainer

What this task measures.

Text classification maps an input string to one or more labels: sentiment, intent, topic, urgency, moderation category, compliance risk, or routing destination. The important design choice is whether labels are stable enough for a fine-tuned classifier or fluid enough that an LLM policy classifier is easier to maintain.

02 - Benchmarks

Use a benchmark ladder.

One leaderboard rarely captures the task. Use the canonical benchmark for lineage, then add harder or more domain-specific checks before choosing a model.

BenchmarkRoleMetricCaveat
GLUEHistorical NLU suiteAverage task scoreSaturated; useful for model lineage more than production classifier selection.
SuperGLUEHarder NLU suiteAverage task scoreStill a broad language-understanding proxy, not a domain-label benchmark.
GoEmotions / SST-2Sentiment and emotionAccuracy / macro F1Good for public comparison; label definitions rarely match business policies.
Local validation setProduction gateMacro F1 / AUROC / calibrationRequired for imbalance, drift, threshold tuning, and costly minority-class misses.
03 - Evaluation

What to compare.

The public benchmark is a shortlist signal. Production choice still depends on latency, cost, domain drift, and how expensive mistakes are.

AxisValueWhy it matters
Historical benchmarkGLUE / SuperGLUEUseful for lineage, but saturated for frontier model selection.
Production metricF1 / AUROC / calibrationAccuracy hides minority-class misses and bad confidence estimates.
Model familiesDeBERTa, SetFit, zero-shot NLI, LLMsPick by label stability, data volume, and explanation needs.
Failure modeLabel driftSupport tickets, policy categories, and abuse labels change over time.
04 - Routing

Pick by task shape.

Stable labels, high volume

Fine-tuned encoder

Best latency and cost once you have labeled examples.

Few examples

SetFit or small fine-tune

Works well when each class has only a handful of samples.

No training data

Zero-shot NLI

Good for first pass labeling before annotation exists.

Explain decision

LLM classifier

Use when rationale and flexible policy language matter.

05 - Related

Need implementation details?

Open the lower-level explainer for architecture, code examples, and implementation options.

Open GLUE editorial ->