Codesota · Registry · NLP · Text embeddings · MTEB15 models · 56+ datasets · 8 task types← back to benchmarks

§ 00 · Benchmark

MTEB leaderboard:
best embedding models for RAG.

Direct answer: MTEB is the standard public embedding leaderboard for retrieval, classification, clustering, reranking, semantic similarity, and multilingual search. Use it as a shortlist, then choose by latency, dimensions, context, and license. MTEB and MMTEB are the best public shortlist for text embedding models, but RAG needs a narrower question: which model retrieves the right chunks for your corpus, language mix, latency budget, vector database, and license constraints?

56+

Datasets

Task categories

72.32

MMTEB ceiling score

Models tracked

View leaderboard Production guide RAG model tiers Key papers

§ 01 · RAG selection

The practical answer is tiered.

KaLM-Embedding-Gemma3-12B-2511 is a credible ceiling model on the MMTEB table. That does not automatically make it the default model for a production RAG pipeline. Treat the benchmark leader as the upper bound, then compare it against Qwen3 and smaller baselines on your own retrieval set.

Tier	Model	Use when	Caveat
Ceiling / benchmark chase	tencent/KaLM-Embedding-Gemma3-12B-2511	Use when quality experiments, multilingual retrieval, and leaderboard coverage matter more than serving cost.	11.76B parameters, 3840 dims, custom community license, and higher memory/indexing cost.
Premium practical	Qwen/Qwen3-Embedding-8B or Qwen/Qwen3-Embedding-4B	Use when you want a very strong multilingual model with long context and a cleaner deployment story.	Still expensive for high-volume vectorization; compare the 4B and 0.6B variants before defaulting to 8B.
Production baseline	BAAI/bge-m3, intfloat/multilingual-e5-large-instruct, embeddinggemma-300m	Use for first production RAG systems, hybrid search, or teams that value latency and operational simplicity.	May lose benchmark points, but often wins after quantization, reranking, and corpus-specific tuning.

Verdict

Phrase KaLM as a top-tier open multilingual embedding ceiling model, especially for MMTEB-style aggregate and retrieval-heavy comparisons. Avoid calling it “the SOTA embedding model” globally: Qwen3 still wins some subcategories and may be the stronger production default.

§ 02

Origins

Before MTEB, comparison was chaos.

The Massive Text Embedding Benchmark was introduced in 2022 by Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers at Hugging Face. Before MTEB, comparing embedding models was a mess: each paper picked its own tasks, its own datasets, its own evaluation protocol. MTEB brought order.

The benchmark evaluates embeddings across eight distinct task categories spanning retrieval, classification, clustering, semantic similarity, reranking, pair classification, summarisation, and bitext mining. That breadth is the point: a model that dominates retrieval can still fail at clustering. MTEB catches that.

Since its release, MTEB has become the standard evaluation suite for the embedding community. The HuggingFace leaderboard has over 5,000 model submissions. Every major embedding model from OpenAI, Google, Cohere, Alibaba, and Microsoft now reports MTEB scores. It is to embeddings what ImageNet was to vision.

§ 03 · Leaderboard

15 models, ranked by MMTEB-style aggregate score.

Updated 2026-05-17

#	Model	Type	Positioning	Avg	Retrieval	Class.	Cluster.	STS	Rerank	Dims	Params	License	Links
01	KaLM-Embedding-Gemma3-12B Tencent	Open weight	Ceiling model	72.32	75.7	77.9	55.8	79.0	67.3	3840	11.76B	tencent-kalm-embedding-community	paper code HF
02	Qwen3-Embedding-8B Qwen / Alibaba	Open source	Premium practical	70.58	70.9	74.0	57.6	81.1	65.6	4096	8B	Apache-2.0	paper HF
03	Seed1.6-embedding-1215 ByteDance	API	API candidate	70.26	66.0	76.8	56.8	75.9	66.2	1536	—	API terms
04	llama-embed-nemotron-8b NVIDIA	Open source	Open candidate	69.46	68.7	73.2	54.4	79.4	67.8	4096	8B	Check model card	HF
05	Qwen3-Embedding-4B Qwen / Alibaba	Open source	Premium practical	69.45	69.6	72.3	57.1	80.9	65.1	2560	4B	Apache-2.0	paper HF
06	gemini-embedding-001 Google	API	API candidate	68.37	67.7	71.8	54.6	79.4	65.6	3072	—	API terms
07	Octen-Embedding-8B Octen	Open source	Open candidate	67.85	71.7	66.7	55.7	81.3	67.6	4096	8B	Check model card	HF
08	Qwen3-Embedding-0.6B Qwen / Alibaba	Open source	Efficient baseline	64.34	64.7	66.8	52.3	76.2	61.4	1024	0.6B	Apache-2.0	paper HF
09	multilingual-e5-large-instruct Microsoft	Open source	Production baseline	63.22	57.1	64.9	50.8	76.8	62.6	1024	560M	MIT	paper HF
10	gte-Qwen2-7B-instruct Alibaba	Open source	Legacy strong	62.51	60.1	61.5	52.8	74.0	65.5	3584	7B	Apache-2.0	paper HF
11	text-multilingual-embedding-002 Google	API	API candidate	62.16	59.7	64.6	47.8	76.1	61.2	768	—	API terms
12	bge-m3 BAAI	Open source	Production baseline	59.56	57.9	62.3	48.2	74.5	56.8	1024	568M	MIT	paper code HF
13	text-embedding-3-large OpenAI	API	API baseline	58.96	56.1	62.5	45.2	72.5	54.1	3072	—	API terms
14	voyage-3.5 Voyage AI	API	API candidate	58.46	55.9	61.8	44.6	71.9	53.5	1024	—	API terms
15	jina-embeddings-v3 Jina AI	Open source	Production baseline	58.37	54.5	61.2	43.8	71.3	52.9	1024	570M	CC BY-NC 4.0	paper HF

Fig 03 · Aggregated public leaderboard numbers are useful for shortlisting, not final RAG procurement. KaLM’s 72.32 comes from its Hugging Face model-card MMTEB table; Qwen3 is still stronger on some subcategories and has a cleaner Apache-2.0 deployment story. Full MTEB leaderboard with 5,000+ submissions lives at huggingface.co/spaces/mteb/leaderboard.

§ 04 · Datasets

The golden datasets.

MTEB’s power comes from its datasets. These aren’t synthetic toy problems — they’re real-world corpora with human annotations, covering domains from medical retrieval to banking intent classification. Four that define what it means to have good embeddings.

Semantic Textual Similarity

STS Benchmark

8,628 sentence pairs

Cer et al., 2017 (SemEval)

The cornerstone STS dataset. Human annotators rated sentence pairs on a 0-5 scale of semantic equivalence. Used as THE standard test for embedding quality since 2017.

Real examples

A"A plane is taking off."

B"An air plane is taking off."

5.00Perfect equivalence

A"A woman is playing the guitar."

B"A man is playing the flute."

1.60Different actions, different agents

A"A man is smoking."

B"A man is skating."

0.50Nearly unrelated

Retrieval

NFCorpus

3,633 queries, 169,756 documents

Boteva et al., 2016 (NutritionFacts)

Medical information retrieval: given a health query, find relevant scientific documents. Tests whether embeddings can bridge the gap between lay language and medical terminology.

Real examples

A"Does caffeine affect blood pressure?"

B"Acute effects of coffee consumption on self-reported gastrointestinal symptoms, blood pressure and stress indices..."

RelevantLay query matched to scientific abstract

A"vitamin D deficiency symptoms"

B"The role of vitamin D in reducing cancer risk and progression..."

RelevantSymptom query matched to clinical review

Retrieval

ArguAna

1,406 queries, 8,674 arguments

Wachsmuth et al., 2018

Counterargument retrieval: given an argument on a controversial topic, find the best counterargument. One of the hardest retrieval tasks because surface-level similarity is misleading.

Real examples

A"Nuclear energy is clean and efficient, producing minimal greenhouse gases..."

B"Nuclear waste remains radioactive for thousands of years with no safe long-term storage solution..."

CounterTopically similar but argumentatively opposed

Classification

Banking77

13,083 customer queries

Casanueva et al., 2020

Intent detection in banking: classify customer messages into 77 fine-grained intents. Tests whether embeddings can distinguish between highly similar intents like "card_arrival" vs "card_delivery_estimate".

Real examples

A"Why was I charged twice for the same transaction?"

Intent: card_payment_wrong_exchange_rateFine-grained intent classification

A"My card doesn't work at ATMs abroad"

Intent: card_not_workingMust distinguish from similar card intents

§ 05 · Tasks

Eight tasks, eight kinds of understanding.

MTEB evaluates embeddings across eight fundamentally different tasks. A great embedding model must excel at all of them — each tests a different facet of text understanding.

Retrieval

NDCG@10

15 datasets

Given a query, find the most relevant documents from a corpus.

Example

"What is the capital of France?"

"Paris is the capital and most populous city of France, with an estimated population of 2,165,423."

The model must rank documents about Paris as capital highest among thousands of candidates.

How it works — Encode query and all documents independently. Rank by cosine similarity. NDCG@10 measures if relevant docs appear in top 10.

MS MARCONQHotpotQAFiQA+2 more

Classification

Accuracy

12 datasets

Classify text into categories using embeddings as features.

Example

"This product broke after two days. Terrible quality."

Label: Negative

Embeddings are used as features for a logistic regression classifier. No fine-tuning of the embedding model.

How it works — Embed all texts, fit a simple classifier (kNN or logistic regression) on train embeddings, evaluate on test set.

AmazonCounterfactualBanking77EmotionClassificationTweetSentiment+1 more

Clustering

V-measure

11 datasets

Group semantically similar texts into clusters without labels.

Example

Cluster: ["quantum computing advances", "new qubit architecture", "stock market rally", "GDP growth forecast"]

Expected: {Science: [0,1], Finance: [2,3]}

Embeddings of similar topics should be closer together than embeddings of different topics.

How it works — Embed all texts, run k-means or mini-batch k-means, compare predicted clusters to ground truth with V-measure.

ArXiv Clustering (S2S)Reddit ClusteringStackExchange ClusteringTwentyNewsgroups

Reranking

MAP

4 datasets

Given a query and candidate documents, reorder by relevance.

Example

"How to fix segmentation fault in C?"

Reorder: [doc_A (irrelevant), doc_B (relevant), doc_C (partial)] -> [doc_B, doc_C, doc_A]

Unlike retrieval, candidates are pre-selected. The model must reorder them by relevance.

How it works — Score each query-document pair by cosine similarity, reorder candidates. Evaluate with Mean Average Precision (MAP).

AskUbuntuDupQuestionsMindSmallRerankingSciDocsRRStackOverflowDupQuestions

Semantic Textual Similarity

Spearman correlation

10 datasets

Predict the degree of semantic equivalence between sentence pairs.

Example

"A man is playing a guitar." vs "A person plays a musical instrument."

Human score: 4.2 / 5.0 (highly similar)

Model cosine similarity should correlate with human judgments across thousands of sentence pairs.

How it works — Compute cosine similarity for each sentence pair. Measure Spearman rank correlation with human-annotated similarity scores.

STS BenchmarkSTS12STS13STS14+4 more

Pair Classification

Avg Precision (AP)

3 datasets

Determine the relationship between two texts (duplicate, paraphrase, entailment).

Example

"How do I reset my password?" vs "I forgot my login credentials, how to recover?"

Label: Duplicate

Cosine similarity between embeddings must separate duplicate pairs from non-duplicate pairs.

How it works — Compute cosine similarity for each pair. Use similarity as a classifier score. Evaluate with average precision (AP).

TwitterURLCorpusSprintDuplicateQuestionsQuora Duplicate Questions (QQP subset)

Summarization

Spearman correlation

1 datasets

Evaluate how well a summary captures the meaning of a source document.

Example

Source: [full news article about climate policy]

Summary: "New climate bill targets 50% emission reduction by 2030"

Embedding similarity between source and summary should correlate with human quality judgments.

How it works — Embed source documents and their summaries. Cosine similarity should correlate with human-rated summary quality scores.

SummEval

Bitext Mining

2 datasets

Find translation pairs between two sets of sentences in different languages.

Example

EN: "The cat sat on the mat."

DE: "Die Katze saß auf der Matte."

Cross-lingual embeddings must place translations closer than non-translation pairs.

How it works — Embed sentences in both languages. Match each source sentence to its nearest neighbor in the target language. Evaluate with F1.

TatoebaBUCC

§ 06 · Timeline

Benchmark progress, 2019 → 2025.

From Sentence-BERT’s first dedicated sentence embeddings to today’s 12B-parameter models scoring 72+. Three eras: encoder-only (BERT/RoBERTa fine-tuning), instruction-tuned (E5, BGE), and LLM-based (Qwen3, KaLM).

2019

Sentence-BERT~51

Reimers & Gurevych show that BERT with siamese fine-tuning creates meaningful sentence embeddings. The field is born.

Encoder-only

2020

SimCSE~54

Contrastive learning on unsupervised data (dropout as augmentation) pushes STS scores without labeled data.

Encoder-only

2022

E5-base~57

Microsoft shows that weakly-supervised contrastive pre-training on massive web data creates superior embeddings. MTEB paper published.

Encoder-only

2023

bge-large-en-v1.5~60

BAAI's BGE family takes the lead with instruction-following and hard negative mining. Open-source catches up to OpenAI.

Instruction-tuned

2024 Q1

E5-Mistral-7B~62

Microsoft proves that LLM backbones (Mistral-7B) create better embeddings than encoder-only models. A paradigm shift.

LLM-based

2024 Q2

gte-Qwen2-7B~63

Alibaba shows that Qwen2 backbone with GTE training matches E5-Mistral. LLM-based embeddings become the norm.

LLM-based

2024 Q4

bge-m3 / Jina v3~59

Multi-granularity (dense + sparse + colbert) and task-LoRA adapters emerge as efficiency-focused alternatives.

Instruction-tuned

2025 Q1

Qwen3-Embedding-8B~70

Qwen3 family dominates with multi-task training across embedding + reranking tasks. First models to consistently break 70.

LLM-based

2025 Q4

KaLM-Gemma3-12B-251172.32

KaLM reports the top Borda rank on its MMTEB table, making it a credible ceiling model for broad multilingual embedding experiments, especially retrieval.

LLM-based

§ 07 · Trade-off

Accuracy vs model size.

LLM-based embeddings dominate the top of the public aggregate tables, but efficiency varies wildly. Qwen3-Embedding-0.6B scores 64.34 with just 600M parameters, while KaLM-Gemma3-12B needs 11.76B for 72.32. In production RAG, score-per-parameter, embedding dimension, and vector index cost matter.

Efficiency leaders

Model	Params	Score	pts / B
Qwen3-Embedding-0.6B	0.6B	64.34	107.2 pts/B
multilingual-e5-large	560M	63.22	112.9 pts/B
bge-m3	568M	59.56	104.9 pts/B
jina-embeddings-v3	570M	58.37	102.4 pts/B

Absolute performance leaders

Model	Params	Score	Dims
KaLM-Gemma3-12B	11.76B	72.32	3840
Qwen3-Embedding-8B	8B	70.58	4096
Seed1.6-embedding	—	70.26	1536
llama-embed-nemotron-8b	8B	69.46	4096

§ 08a

Essay

The LLM embedding revolution.

Before 2024, embedding models were small encoder-only transformers — BERT, RoBERTa, XLM-R. They maxed out around 560M parameters and scored ~60 on MTEB. Then researchers discovered that decoder-only LLMs make better embedding backbones.

E5-Mistral proved it first: take Mistral-7B, add contrastive fine-tuning, and you get embeddings that outperform the older encoder-only generation on broad leaderboards. At the ceiling, the current shortlist is dominated by LLM backbones such as Gemma3, Qwen3, and LLaMA. Smaller encoder models still matter when latency, memory, or edge deployment decides the system.

§ 08b

Essay

Open weights vs API: the leaderboard gap closed.

In 2023, OpenAI’s text-embedding-3-large was considered best-in-class for many teams. On this aggregate table it ranks 13th with 58.96, behind several open-weight models. That is a strong signal, but not a procurement verdict: API models can still win on support, compliance, autoscaling, or integration.

KaLM-Gemma3-12B (open-weight) — 72.32 aggregate, top reported MMTEB Borda rank.
gemini-embedding-001 (API) — 68.37 on the KaLM model-card table.
text-embedding-3-large (API) — 58.96 on this page’s legacy aggregate row.
Qwen3-0.6B (open-weight, small) — 64.34, a strong baseline for cheap multilingual experiments.

§ 09 · Reproduce

Run MTEB yourself.

MTEB is fully open source. Install it, pick a model, and benchmark it against the entire suite in a single script. Results are automatically formatted for submission to the HuggingFace leaderboard.

Full MTEB evaluationPython

# Install
pip install mteb sentence-transformers

# Run full English benchmark
import mteb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")

# Run all English tasks (56+ datasets)
evaluation = mteb.MTEB(tasks=mteb.get_tasks(languages=["eng"]))
results = evaluation.run(model, output_folder="results/qwen3-0.6b")

# Or run specific task types
retrieval_tasks = mteb.get_tasks(
    languages=["eng"],
    task_types=["Retrieval"]
)
evaluation = mteb.MTEB(tasks=retrieval_tasks)
results = evaluation.run(model, output_folder="results/retrieval-only")

Quick start — use shortlisted modelsPython

# Option 1: ceiling benchmark model (KaLM-Gemma3-12B — high memory)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("tencent/KaLM-Embedding-Gemma3-12B-2511", trust_remote_code=True)
embeddings = model.encode(["What is machine learning?", "ML is a subset of AI."])
print(f"Similarity: {embeddings[0] @ embeddings[1]:.4f}")

# Option 2: practical baseline (Qwen3-0.6B — cheaper to run)
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
embeddings = model.encode(["Hello world", "Bonjour le monde"])

# Option 3: Production serving with HuggingFace TEI
# docker run --gpus all -p 8080:80 \
#   ghcr.io/huggingface/text-embeddings-inference:latest \
#   --model-id Qwen/Qwen3-Embedding-0.6B

§ 10 · Papers

Key papers.

Essential reading for understanding MTEB and modern text embeddings.

MTEB: Massive Text Embedding Benchmark

Muennighoff, Tazi, Magne, Reimers·EACL 2023·1,200+ citations

Original benchmark paper

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, Gurevych·EMNLP 2019·8,000+ citations

Foundation of modern embeddings

Text Embeddings by Weakly-Supervised Contrastive Pre-training (E5)

Wang, Yang, Wei, et al.·ACL 2024·1,500+ citations

E5 embedding family

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity

Chen, Xiao, Zhang, et al.·ACL 2024 Findings·600+ citations

Multi-granularity retrieval

Improving Text Embeddings with Large Language Models (E5-Mistral)

Wang, Yang, Wei, et al.·ACL 2024·500+ citations

LLM-based embeddings

Jina Embeddings v3: Task-LoRA Adapters for Multi-Task Embeddings

Sturua, Mohr, et al.·arXiv 2024·100+ citations

Task-specific adapters

GTE: General Text Embeddings

Li, Zhang, et al.·arXiv 2023·400+ citations

Alibaba GTE family

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Qwen Team·arXiv 2025·50+ citations

Strong practical family

§ 11 · Code

Implementations worth reading.

Open-source repositories for training, evaluating, and serving embedding models.

embeddings-benchmark/mteb★ 2.1k

Official MTEB benchmark framework. Run evaluations on any model with a single command.

UKPLab/sentence-transformers★ 15.8k

The de facto library for text embeddings in Python. Load, fine-tune, and deploy embedding models.

FlagOpen/FlagEmbedding★ 8.2k

BAAI's BGE embedding family. Includes bge-m3, bge-reranker, and training code.

QwenLM/Qwen★ 18k

Qwen model family including Qwen3-Embedding. Multi-task training for embedding + reranking.

huggingface/text-embeddings-inference★ 3.4k

Production-grade serving for embedding models. Rust-based, supports batching, quantization.

HKUNLP/instructor-embedding★ 1.3k

Instruction-tuned embeddings. Pioneered the "Represent the X for Y" prompting approach.

§ 12 · Context

MTEB vs other embedding benchmarks.

Benchmark	Tasks	Datasets	Focus	Year
MTEB	8	56+	Comprehensive embedding evaluation	2022
BEIR	1	18	Zero-shot retrieval only	2021
SentEval	4	17	Sentence representation probing	2018
USEB	4	8	Unified sentence embedding eval	2022
KILT	1	11	Knowledge-intensive language tasks	2021
AIR-Bench	2	24	Automated IR benchmark (LLM-judged)	2024

§ 13 · Metrics

Understanding the numbers.

NDCG@10Retrieval

Normalised Discounted Cumulative Gain at rank 10. Measures how well the model ranks relevant documents in the top 10 results, with higher positions weighted more heavily.

NDCG@10 = DCG@10 / IDCG@10
DCG@10 = Σ(rel_i / log₂(i+1))

A score of 1.0 means all relevant documents appear at the top. Most models score 0.4–0.7, reflecting the difficulty of zero-shot retrieval.

Spearman ρSTS

Spearman rank correlation between model cosine similarities and human similarity judgments. Measures whether the model's relative ordering of sentence pairs matches human intuition.

ρ = 1 − (6 × Σd_i²) / (n(n²−1))
where d_i = rank difference for pair i

Spearman correlation of 0.80+ indicates strong alignment with human judgment. Top models now exceed 0.81.

§ 14 · Access

Where to find it.

MTEB GitHub

Official benchmark code. Install with pip install mteb.

HuggingFace Leaderboard

Full leaderboard with 5,000+ submissions and filters.

Original Paper

Muennighoff et al., EACL 2023. Benchmark design and analysis.

MTEB Datasets

All 56+ datasets available on HuggingFace Datasets.

§ 15 · Applications

When to use embeddings.

Text embeddings convert language into dense vectors that capture semantic meaning. The primary use cases where MTEB-benchmarked models excel.

RAG retrieval

Retrieve the right context chunks before generation. Embedding recall is often the first bottleneck in answer accuracy.

Semantic document search

Find relevant documents by meaning, not just keyword overlap. Embeddings enable natural-language queries over large corpora.

Duplicate detection

Identify near-duplicate content, support tickets, or records using cosine similarity between embedding pairs.

Clustering & topic modeling

Group documents by semantic similarity. Embeddings provide dense features for k-means, HDBSCAN, or topic extraction.

§ 16 · Patterns

Three production architectures.

Three common approaches to generating embeddings in production, each with distinct trade-offs.

Sentence transformers

Models trained specifically for sentence and paragraph embedding. Run locally with full control.

Pros

Optimised for retrieval, fast inference
Many specialised variants available

Cons

Fixed context length
May need domain fine-tuning

LLM embeddings via API

Embedding endpoints from OpenAI, Cohere, Voyage, and others. Zero infrastructure to manage.

Pros

High quality, long context
No infrastructure to maintain

Cons

Cost per token
Data leaves your system

Sparse + dense hybrid

Combine BM25 with dense embeddings for better recall. Best of both worlds for production search.

Pros

Handles exact matches well
More robust for rare terms

Cons

More complex pipeline
Two indices to maintain

§ 17 · Quick start

Embeddings in minutes.

Two approaches — hosted API or local model. Pick one.

OpenAI APIpip install openai

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model='text-embedding-3-large'):
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

documents = [
    'The cat sat on the mat',
    'A dog played in the park',
    'Machine learning is fascinating'
]

embeddings = [get_embedding(doc) for doc in documents]
print(f'Embedding dimension: {len(embeddings[0])}')

Local with sentence-transformerspip install sentence-transformers numpy

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('BAAI/bge-large-en-v1.5')

documents = [
    'The cat sat on the mat',
    'A dog played in the park',
    'Machine learning is fascinating'
]

embeddings = model.encode(documents, normalize_embeddings=True)

query = 'pets resting at home'
query_embedding = model.encode(query, normalize_embeddings=True)

similarities = np.dot(embeddings, query_embedding)
for doc, sim in zip(documents, similarities):
    print(f'{sim:.3f}: {doc}')

§ 18 · More

Track more benchmarks.

MTEB is one of many benchmarks we track. Explore the full catalogue of NLP, computer vision, and reasoning benchmarks with live leaderboards.

Learn · what is an embedding?Text embedding building block All benchmarks

MTEB leaderboard:best embedding models for RAG.

The practical answer is tiered.

Before MTEB, comparison was chaos.

15 models, ranked by MMTEB-style aggregate score.

The golden datasets.

STS Benchmark

NFCorpus

ArguAna

Banking77

Eight tasks, eight kinds of understanding.

Retrieval

Classification

Clustering

Reranking

Semantic Textual Similarity

Pair Classification

Summarization

Bitext Mining

Benchmark progress, 2019 → 2025.

Accuracy vs model size.

The LLM embedding revolution.

Open weights vs API: the leaderboard gap closed.

Run MTEB yourself.

Key papers.

Implementations worth reading.

MTEB vs other embedding benchmarks.

Understanding the numbers.

Where to find it.

When to use embeddings.

Three production architectures.

Sentence transformers

LLM embeddings via API

Sparse + dense hybrid

Embeddings in minutes.

Track more benchmarks.

MTEB leaderboard:
best embedding models for RAG.