Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Language & TextWhich model, what task, at what costIssue: March 2026
§ 00 · Language & text

Text task router

Pick the text output you need: answer, vector, label, entities, translation, or summary. LLM leaderboards are only one slice of the language stack.

Use `/llm` for frontier reasoning, `/benchmarks/mteb` for embeddings, and the task rows below for specialised NLP work.

§ 01 · Text tasks

Not every task needs an LLM.

Six text-processing axes where specialised models still compete — or win outright — on latency, cost, or accuracy at scale.

Text Embeddings

Semantic search, RAG, clustering

MTEB
KaLM-Gemma3-12B (72.3%)

Translation

33+ languages, document-level

WMT
HY-MT1.5 (WMT2025 winner)

Question Answering

Extractive, abstractive, multi-hop

SQuAD, TriviaQA
GPT-5 / Claude 4

Named Entity Recognition

People, orgs, locations, custom

CoNLL-2003
Fine-tuned DeBERTa v3

Text Classification

Sentiment, intent, topic

GLUE, SuperGLUE
DeBERTa v3 (GLUE 91.3)

Summarization

News, documents, conversations

CNN/DailyMail
Claude 4 / GPT-5
§ 02 · Decision

LLM, or specialised model?

Use an LLM when
  • ·Few examples available (few-shot)
  • ·Complex, nuanced task definitions
  • ·You need to explain reasoning
  • ·The task evolves frequently
  • ·Low volume (< 10K requests/day)
Use a specialised model when
  • ·High volume (> 100K requests/day)
  • ·Latency critical (< 100ms)
  • ·Cost sensitive (pennies per 1K calls)
  • ·Well-defined, stable task
  • ·Training data available
§ 03 · Keep reading

Go deeper.

Verified benchmarks across every text task. Submit new SOTA results or suggest benchmarks we should be tracking.

Frontier leaderboard MTEB embedding benchmarkAll NLP benchmarks