Codesota · Registry · NLP · Text embeddings · MTEB15 models · 56+ datasets · 8 task types← back to benchmarks
§ 00 · Benchmark

MTEB leaderboard:
best embedding models for RAG.

Direct answer: MTEB is the standard public embedding leaderboard for retrieval, classification, clustering, reranking, semantic similarity, and multilingual search. Use it as a shortlist, then choose by latency, dimensions, context, and license. MTEB and MMTEB are the best public shortlist for text embedding models, but RAG needs a narrower question: which model retrieves the right chunks for your corpus, language mix, latency budget, vector database, and license constraints?

56+
Datasets
8
Task categories
72.32
MMTEB ceiling score
15
Models tracked
§ 01 · RAG selection

The practical answer is tiered.

KaLM-Embedding-Gemma3-12B-2511 is a credible ceiling model on the MMTEB table. That does not automatically make it the default model for a production RAG pipeline. Treat the benchmark leader as the upper bound, then compare it against Qwen3 and smaller baselines on your own retrieval set.

TierModelUse whenCaveat
Ceiling / benchmark chasetencent/KaLM-Embedding-Gemma3-12B-2511Use when quality experiments, multilingual retrieval, and leaderboard coverage matter more than serving cost.11.76B parameters, 3840 dims, custom community license, and higher memory/indexing cost.
Premium practicalQwen/Qwen3-Embedding-8B or Qwen/Qwen3-Embedding-4BUse when you want a very strong multilingual model with long context and a cleaner deployment story.Still expensive for high-volume vectorization; compare the 4B and 0.6B variants before defaulting to 8B.
Production baselineBAAI/bge-m3, intfloat/multilingual-e5-large-instruct, embeddinggemma-300mUse for first production RAG systems, hybrid search, or teams that value latency and operational simplicity.May lose benchmark points, but often wins after quantization, reranking, and corpus-specific tuning.
Verdict

Phrase KaLM as a top-tier open multilingual embedding ceiling model, especially for MMTEB-style aggregate and retrieval-heavy comparisons. Avoid calling it “the SOTA embedding model” globally: Qwen3 still wins some subcategories and may be the stronger production default.

§ 02
Origins

Before MTEB, comparison was chaos.

The Massive Text Embedding Benchmark was introduced in 2022 by Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers at Hugging Face. Before MTEB, comparing embedding models was a mess: each paper picked its own tasks, its own datasets, its own evaluation protocol. MTEB brought order.

The benchmark evaluates embeddings across eight distinct task categories spanning retrieval, classification, clustering, semantic similarity, reranking, pair classification, summarisation, and bitext mining. That breadth is the point: a model that dominates retrieval can still fail at clustering. MTEB catches that.

Since its release, MTEB has become the standard evaluation suite for the embedding community. The HuggingFace leaderboard has over 5,000 model submissions. Every major embedding model from OpenAI, Google, Cohere, Alibaba, and Microsoft now reports MTEB scores. It is to embeddings what ImageNet was to vision.

§ 03 · Leaderboard

15 models, ranked by MMTEB-style aggregate score.

Updated 2026-05-17
#ModelTypePositioningAvgRetrievalClass.Cluster.STSRerankDimsParamsLicenseLinks
01KaLM-Embedding-Gemma3-12B
Tencent
Open weightCeiling model72.3275.777.955.879.067.3384011.76Btencent-kalm-embedding-communitypapercodeHF
02Qwen3-Embedding-8B
Qwen / Alibaba
Open sourcePremium practical70.5870.974.057.681.165.640968BApache-2.0paperHF
03Seed1.6-embedding-1215
ByteDance
APIAPI candidate70.2666.076.856.875.966.21536API terms
04llama-embed-nemotron-8b
NVIDIA
Open sourceOpen candidate69.4668.773.254.479.467.840968BCheck model cardHF
05Qwen3-Embedding-4B
Qwen / Alibaba
Open sourcePremium practical69.4569.672.357.180.965.125604BApache-2.0paperHF
06gemini-embedding-001
Google
APIAPI candidate68.3767.771.854.679.465.63072API terms
07Octen-Embedding-8B
Octen
Open sourceOpen candidate67.8571.766.755.781.367.640968BCheck model cardHF
08Qwen3-Embedding-0.6B
Qwen / Alibaba
Open sourceEfficient baseline64.3464.766.852.376.261.410240.6BApache-2.0paperHF
09multilingual-e5-large-instruct
Microsoft
Open sourceProduction baseline63.2257.164.950.876.862.61024560MMITpaperHF
10gte-Qwen2-7B-instruct
Alibaba
Open sourceLegacy strong62.5160.161.552.874.065.535847BApache-2.0paperHF
11text-multilingual-embedding-002
Google
APIAPI candidate62.1659.764.647.876.161.2768API terms
12bge-m3
BAAI
Open sourceProduction baseline59.5657.962.348.274.556.81024568MMITpapercodeHF
13text-embedding-3-large
OpenAI
APIAPI baseline58.9656.162.545.272.554.13072API terms
14voyage-3.5
Voyage AI
APIAPI candidate58.4655.961.844.671.953.51024API terms
15jina-embeddings-v3
Jina AI
Open sourceProduction baseline58.3754.561.243.871.352.91024570MCC BY-NC 4.0paperHF
Fig 03 · Aggregated public leaderboard numbers are useful for shortlisting, not final RAG procurement. KaLM’s 72.32 comes from its Hugging Face model-card MMTEB table; Qwen3 is still stronger on some subcategories and has a cleaner Apache-2.0 deployment story. Full MTEB leaderboard with 5,000+ submissions lives at huggingface.co/spaces/mteb/leaderboard.
§ 04 · Datasets

The golden datasets.

MTEB’s power comes from its datasets. These aren’t synthetic toy problems — they’re real-world corpora with human annotations, covering domains from medical retrieval to banking intent classification. Four that define what it means to have good embeddings.

Semantic Textual Similarity

STS Benchmark

8,628 sentence pairs
Cer et al., 2017 (SemEval)

The cornerstone STS dataset. Human annotators rated sentence pairs on a 0-5 scale of semantic equivalence. Used as THE standard test for embedding quality since 2017.

Real examples
A"A plane is taking off."
B"An air plane is taking off."
5.00Perfect equivalence
A"A woman is playing the guitar."
B"A man is playing the flute."
1.60Different actions, different agents
A"A man is smoking."
B"A man is skating."
0.50Nearly unrelated
Retrieval

NFCorpus

3,633 queries, 169,756 documents
Boteva et al., 2016 (NutritionFacts)

Medical information retrieval: given a health query, find relevant scientific documents. Tests whether embeddings can bridge the gap between lay language and medical terminology.

Real examples
A"Does caffeine affect blood pressure?"
B"Acute effects of coffee consumption on self-reported gastrointestinal symptoms, blood pressure and stress indices..."
RelevantLay query matched to scientific abstract
A"vitamin D deficiency symptoms"
B"The role of vitamin D in reducing cancer risk and progression..."
RelevantSymptom query matched to clinical review
Retrieval

ArguAna

1,406 queries, 8,674 arguments
Wachsmuth et al., 2018

Counterargument retrieval: given an argument on a controversial topic, find the best counterargument. One of the hardest retrieval tasks because surface-level similarity is misleading.

Real examples
A"Nuclear energy is clean and efficient, producing minimal greenhouse gases..."
B"Nuclear waste remains radioactive for thousands of years with no safe long-term storage solution..."
CounterTopically similar but argumentatively opposed
Classification

Banking77

13,083 customer queries
Casanueva et al., 2020

Intent detection in banking: classify customer messages into 77 fine-grained intents. Tests whether embeddings can distinguish between highly similar intents like "card_arrival" vs "card_delivery_estimate".

Real examples
A"Why was I charged twice for the same transaction?"
Intent: card_payment_wrong_exchange_rateFine-grained intent classification
A"My card doesn't work at ATMs abroad"
Intent: card_not_workingMust distinguish from similar card intents
§ 05 · Tasks

Eight tasks, eight kinds of understanding.

MTEB evaluates embeddings across eight fundamentally different tasks. A great embedding model must excel at all of them — each tests a different facet of text understanding.

Retrieval

NDCG@10
15 datasets

Given a query, find the most relevant documents from a corpus.

Example
"What is the capital of France?"
"Paris is the capital and most populous city of France, with an estimated population of 2,165,423."
The model must rank documents about Paris as capital highest among thousands of candidates.
How it works — Encode query and all documents independently. Rank by cosine similarity. NDCG@10 measures if relevant docs appear in top 10.
MS MARCONQHotpotQAFiQA+2 more

Classification

Accuracy
12 datasets

Classify text into categories using embeddings as features.

Example
"This product broke after two days. Terrible quality."
Label: Negative
Embeddings are used as features for a logistic regression classifier. No fine-tuning of the embedding model.
How it works — Embed all texts, fit a simple classifier (kNN or logistic regression) on train embeddings, evaluate on test set.
AmazonCounterfactualBanking77EmotionClassificationTweetSentiment+1 more

Clustering

V-measure
11 datasets

Group semantically similar texts into clusters without labels.

Example
Cluster: ["quantum computing advances", "new qubit architecture", "stock market rally", "GDP growth forecast"]
Expected: {Science: [0,1], Finance: [2,3]}
Embeddings of similar topics should be closer together than embeddings of different topics.
How it works — Embed all texts, run k-means or mini-batch k-means, compare predicted clusters to ground truth with V-measure.
ArXiv Clustering (S2S)Reddit ClusteringStackExchange ClusteringTwentyNewsgroups

Reranking

MAP
4 datasets

Given a query and candidate documents, reorder by relevance.

Example
"How to fix segmentation fault in C?"
Reorder: [doc_A (irrelevant), doc_B (relevant), doc_C (partial)] -> [doc_B, doc_C, doc_A]
Unlike retrieval, candidates are pre-selected. The model must reorder them by relevance.
How it works — Score each query-document pair by cosine similarity, reorder candidates. Evaluate with Mean Average Precision (MAP).
AskUbuntuDupQuestionsMindSmallRerankingSciDocsRRStackOverflowDupQuestions

Semantic Textual Similarity

Spearman correlation
10 datasets

Predict the degree of semantic equivalence between sentence pairs.

Example
"A man is playing a guitar." vs "A person plays a musical instrument."
Human score: 4.2 / 5.0 (highly similar)
Model cosine similarity should correlate with human judgments across thousands of sentence pairs.
How it works — Compute cosine similarity for each sentence pair. Measure Spearman rank correlation with human-annotated similarity scores.
STS BenchmarkSTS12STS13STS14+4 more

Pair Classification

Avg Precision (AP)
3 datasets

Determine the relationship between two texts (duplicate, paraphrase, entailment).

Example
"How do I reset my password?" vs "I forgot my login credentials, how to recover?"
Label: Duplicate
Cosine similarity between embeddings must separate duplicate pairs from non-duplicate pairs.
How it works — Compute cosine similarity for each pair. Use similarity as a classifier score. Evaluate with average precision (AP).
TwitterURLCorpusSprintDuplicateQuestionsQuora Duplicate Questions (QQP subset)

Summarization

Spearman correlation
1 datasets

Evaluate how well a summary captures the meaning of a source document.

Example
Source: [full news article about climate policy]
Summary: "New climate bill targets 50% emission reduction by 2030"
Embedding similarity between source and summary should correlate with human quality judgments.
How it works — Embed source documents and their summaries. Cosine similarity should correlate with human-rated summary quality scores.
SummEval

Bitext Mining

F1
2 datasets

Find translation pairs between two sets of sentences in different languages.

Example
EN: "The cat sat on the mat."
DE: "Die Katze saß auf der Matte."
Cross-lingual embeddings must place translations closer than non-translation pairs.
How it works — Embed sentences in both languages. Match each source sentence to its nearest neighbor in the target language. Evaluate with F1.
TatoebaBUCC
§ 06 · Timeline

Benchmark progress, 2019 → 2025.

From Sentence-BERT’s first dedicated sentence embeddings to today’s 12B-parameter models scoring 72+. Three eras: encoder-only (BERT/RoBERTa fine-tuning), instruction-tuned (E5, BGE), and LLM-based (Qwen3, KaLM).

2019
Sentence-BERT~51
Reimers & Gurevych show that BERT with siamese fine-tuning creates meaningful sentence embeddings. The field is born.
Encoder-only
2020
SimCSE~54
Contrastive learning on unsupervised data (dropout as augmentation) pushes STS scores without labeled data.
Encoder-only
2022
E5-base~57
Microsoft shows that weakly-supervised contrastive pre-training on massive web data creates superior embeddings. MTEB paper published.
Encoder-only
2023
bge-large-en-v1.5~60
BAAI's BGE family takes the lead with instruction-following and hard negative mining. Open-source catches up to OpenAI.
Instruction-tuned
2024 Q1
E5-Mistral-7B~62
Microsoft proves that LLM backbones (Mistral-7B) create better embeddings than encoder-only models. A paradigm shift.
LLM-based
2024 Q2
gte-Qwen2-7B~63
Alibaba shows that Qwen2 backbone with GTE training matches E5-Mistral. LLM-based embeddings become the norm.
LLM-based
2024 Q4
bge-m3 / Jina v3~59
Multi-granularity (dense + sparse + colbert) and task-LoRA adapters emerge as efficiency-focused alternatives.
Instruction-tuned
2025 Q1
Qwen3-Embedding-8B~70
Qwen3 family dominates with multi-task training across embedding + reranking tasks. First models to consistently break 70.
LLM-based
2025 Q4
KaLM-Gemma3-12B-251172.32
KaLM reports the top Borda rank on its MMTEB table, making it a credible ceiling model for broad multilingual embedding experiments, especially retrieval.
LLM-based
§ 07 · Trade-off

Accuracy vs model size.

LLM-based embeddings dominate the top of the public aggregate tables, but efficiency varies wildly. Qwen3-Embedding-0.6B scores 64.34 with just 600M parameters, while KaLM-Gemma3-12B needs 11.76B for 72.32. In production RAG, score-per-parameter, embedding dimension, and vector index cost matter.

Efficiency leaders
ModelParamsScorepts / B
Qwen3-Embedding-0.6B0.6B64.34107.2 pts/B
multilingual-e5-large560M63.22112.9 pts/B
bge-m3568M59.56104.9 pts/B
jina-embeddings-v3570M58.37102.4 pts/B
Absolute performance leaders
ModelParamsScoreDims
KaLM-Gemma3-12B11.76B72.323840
Qwen3-Embedding-8B8B70.584096
Seed1.6-embedding70.261536
llama-embed-nemotron-8b8B69.464096
§ 08a
Essay

The LLM embedding revolution.

Before 2024, embedding models were small encoder-only transformers — BERT, RoBERTa, XLM-R. They maxed out around 560M parameters and scored ~60 on MTEB. Then researchers discovered that decoder-only LLMs make better embedding backbones.

E5-Mistral proved it first: take Mistral-7B, add contrastive fine-tuning, and you get embeddings that outperform the older encoder-only generation on broad leaderboards. At the ceiling, the current shortlist is dominated by LLM backbones such as Gemma3, Qwen3, and LLaMA. Smaller encoder models still matter when latency, memory, or edge deployment decides the system.

§ 08b
Essay

Open weights vs API: the leaderboard gap closed.

In 2023, OpenAI’s text-embedding-3-large was considered best-in-class for many teams. On this aggregate table it ranks 13th with 58.96, behind several open-weight models. That is a strong signal, but not a procurement verdict: API models can still win on support, compliance, autoscaling, or integration.

  • KaLM-Gemma3-12B (open-weight) — 72.32 aggregate, top reported MMTEB Borda rank.
  • gemini-embedding-001 (API) — 68.37 on the KaLM model-card table.
  • text-embedding-3-large (API) — 58.96 on this page’s legacy aggregate row.
  • Qwen3-0.6B (open-weight, small) — 64.34, a strong baseline for cheap multilingual experiments.
§ 09 · Reproduce

Run MTEB yourself.

MTEB is fully open source. Install it, pick a model, and benchmark it against the entire suite in a single script. Results are automatically formatted for submission to the HuggingFace leaderboard.

Full MTEB evaluationPython
# Install
pip install mteb sentence-transformers

# Run full English benchmark
import mteb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")

# Run all English tasks (56+ datasets)
evaluation = mteb.MTEB(tasks=mteb.get_tasks(languages=["eng"]))
results = evaluation.run(model, output_folder="results/qwen3-0.6b")

# Or run specific task types
retrieval_tasks = mteb.get_tasks(
    languages=["eng"],
    task_types=["Retrieval"]
)
evaluation = mteb.MTEB(tasks=retrieval_tasks)
results = evaluation.run(model, output_folder="results/retrieval-only")
Quick start — use shortlisted modelsPython
# Option 1: ceiling benchmark model (KaLM-Gemma3-12B — high memory)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("tencent/KaLM-Embedding-Gemma3-12B-2511", trust_remote_code=True)
embeddings = model.encode(["What is machine learning?", "ML is a subset of AI."])
print(f"Similarity: {embeddings[0] @ embeddings[1]:.4f}")

# Option 2: practical baseline (Qwen3-0.6B — cheaper to run)
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
embeddings = model.encode(["Hello world", "Bonjour le monde"])

# Option 3: Production serving with HuggingFace TEI
# docker run --gpus all -p 8080:80 \
#   ghcr.io/huggingface/text-embeddings-inference:latest \
#   --model-id Qwen/Qwen3-Embedding-0.6B
§ 10 · Papers

Key papers.

Essential reading for understanding MTEB and modern text embeddings.

MTEB: Massive Text Embedding Benchmark
Muennighoff, Tazi, Magne, Reimers·EACL 2023·1,200+ citations
Original benchmark paper
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers, Gurevych·EMNLP 2019·8,000+ citations
Foundation of modern embeddings
Text Embeddings by Weakly-Supervised Contrastive Pre-training (E5)
Wang, Yang, Wei, et al.·ACL 2024·1,500+ citations
E5 embedding family
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity
Chen, Xiao, Zhang, et al.·ACL 2024 Findings·600+ citations
Multi-granularity retrieval
Improving Text Embeddings with Large Language Models (E5-Mistral)
Wang, Yang, Wei, et al.·ACL 2024·500+ citations
LLM-based embeddings
Jina Embeddings v3: Task-LoRA Adapters for Multi-Task Embeddings
Sturua, Mohr, et al.·arXiv 2024·100+ citations
Task-specific adapters
GTE: General Text Embeddings
Li, Zhang, et al.·arXiv 2023·400+ citations
Alibaba GTE family
Strong practical family
§ 11 · Code

Implementations worth reading.

Open-source repositories for training, evaluating, and serving embedding models.

embeddings-benchmark/mteb2.1k

Official MTEB benchmark framework. Run evaluations on any model with a single command.

UKPLab/sentence-transformers15.8k

The de facto library for text embeddings in Python. Load, fine-tune, and deploy embedding models.

FlagOpen/FlagEmbedding8.2k

BAAI's BGE embedding family. Includes bge-m3, bge-reranker, and training code.

QwenLM/Qwen18k

Qwen model family including Qwen3-Embedding. Multi-task training for embedding + reranking.

huggingface/text-embeddings-inference3.4k

Production-grade serving for embedding models. Rust-based, supports batching, quantization.

HKUNLP/instructor-embedding1.3k

Instruction-tuned embeddings. Pioneered the "Represent the X for Y" prompting approach.

§ 12 · Context

MTEB vs other embedding benchmarks.

BenchmarkTasksDatasetsFocusYear
MTEB856+Comprehensive embedding evaluation2022
BEIR118Zero-shot retrieval only2021
SentEval417Sentence representation probing2018
USEB48Unified sentence embedding eval2022
KILT111Knowledge-intensive language tasks2021
AIR-Bench224Automated IR benchmark (LLM-judged)2024
§ 13 · Metrics

Understanding the numbers.

NDCG@10Retrieval

Normalised Discounted Cumulative Gain at rank 10. Measures how well the model ranks relevant documents in the top 10 results, with higher positions weighted more heavily.

NDCG@10 = DCG@10 / IDCG@10
DCG@10 = Σ(rel_i / log₂(i+1))

A score of 1.0 means all relevant documents appear at the top. Most models score 0.4–0.7, reflecting the difficulty of zero-shot retrieval.

Spearman ρSTS

Spearman rank correlation between model cosine similarities and human similarity judgments. Measures whether the model's relative ordering of sentence pairs matches human intuition.

ρ = 1 − (6 × Σd_i²) / (n(n²−1))
where d_i = rank difference for pair i

Spearman correlation of 0.80+ indicates strong alignment with human judgment. Top models now exceed 0.81.

§ 14 · Access

Where to find it.

MTEB GitHub

Official benchmark code. Install with pip install mteb.

HuggingFace Leaderboard

Full leaderboard with 5,000+ submissions and filters.

Original Paper

Muennighoff et al., EACL 2023. Benchmark design and analysis.

MTEB Datasets

All 56+ datasets available on HuggingFace Datasets.

§ 15 · Applications

When to use embeddings.

Text embeddings convert language into dense vectors that capture semantic meaning. The primary use cases where MTEB-benchmarked models excel.

RAG retrieval

Retrieve the right context chunks before generation. Embedding recall is often the first bottleneck in answer accuracy.

Semantic document search

Find relevant documents by meaning, not just keyword overlap. Embeddings enable natural-language queries over large corpora.

Duplicate detection

Identify near-duplicate content, support tickets, or records using cosine similarity between embedding pairs.

Clustering & topic modeling

Group documents by semantic similarity. Embeddings provide dense features for k-means, HDBSCAN, or topic extraction.

§ 16 · Patterns

Three production architectures.

Three common approaches to generating embeddings in production, each with distinct trade-offs.

Sentence transformers

Models trained specifically for sentence and paragraph embedding. Run locally with full control.

Pros
  • Optimised for retrieval, fast inference
  • Many specialised variants available
Cons
  • Fixed context length
  • May need domain fine-tuning

LLM embeddings via API

Embedding endpoints from OpenAI, Cohere, Voyage, and others. Zero infrastructure to manage.

Pros
  • High quality, long context
  • No infrastructure to maintain
Cons
  • Cost per token
  • Data leaves your system

Sparse + dense hybrid

Combine BM25 with dense embeddings for better recall. Best of both worlds for production search.

Pros
  • Handles exact matches well
  • More robust for rare terms
Cons
  • More complex pipeline
  • Two indices to maintain
§ 17 · Quick start

Embeddings in minutes.

Two approaches — hosted API or local model. Pick one.

OpenAI APIpip install openai
from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model='text-embedding-3-large'):
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

documents = [
    'The cat sat on the mat',
    'A dog played in the park',
    'Machine learning is fascinating'
]

embeddings = [get_embedding(doc) for doc in documents]
print(f'Embedding dimension: {len(embeddings[0])}')
Local with sentence-transformerspip install sentence-transformers numpy
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('BAAI/bge-large-en-v1.5')

documents = [
    'The cat sat on the mat',
    'A dog played in the park',
    'Machine learning is fascinating'
]

embeddings = model.encode(documents, normalize_embeddings=True)

query = 'pets resting at home'
query_embedding = model.encode(query, normalize_embeddings=True)

similarities = np.dot(embeddings, query_embedding)
for doc, sim in zip(documents, similarities):
    print(f'{sim:.3f}: {doc}')
§ 18 · More

Track more benchmarks.

MTEB is one of many benchmarks we track. Explore the full catalogue of NLP, computer vision, and reasoning benchmarks with live leaderboards.