Feature Extraction
Feature extraction — generating dense vector embeddings from text — is the unsung infrastructure layer powering semantic search, RAG pipelines, clustering, and recommendation systems. Sentence-BERT (2019) made it practical, but the field exploded in 2023-2024 with instruction-tuned embedding models like E5-Mistral, GTE-Qwen2, and Nomic Embed that turned decoder-only LLMs into embedding engines, pushing MTEB scores past 70 average across 50+ tasks. The key insight was that pre-training scale transfers to embedding quality — a 7B parameter embedding model crushes a 110M one on zero-shot retrieval. Matryoshka representation learning (Kusupati et al., 2022) added the ability to truncate embeddings to any dimension without retraining, making deployment flexible across latency and storage budgets.
MTEB Leaderboard
Massive Text Embedding Benchmark across 8 task categories
Top 10
Leading models on MTEB Leaderboard.
All datasets
1 dataset tracked for this task.
Related tasks
Other tasks in Natural Language Processing.
Looking to run a model? HuggingFace hosts inference for this task type.