Home/Building Blocks/Text Embedding

Text→Vector

Text Embedding

Convert text into dense vector representations for semantic search, clustering, and retrieval.

How Text Embedding Works

Neural networks convert images and text into vectors (lists of numbers). Similar concepts have similar vectors.

Image becomes a vector

Photo of a cat

768 dimensions (showing 8)

animal

0.82

outdoor

0.15

food

0.03

furry

0.91

wild

0.22

water

0.08

small

0.67

cute

0.44

[0.82, 0.15, 0.03, 0.91, 0.22, 0.08, 0.67, 0.44]

Photo of a dog

768 dimensions (showing 8)

animal

0.88

outdoor

0.45

food

0.05

furry

0.85

wild

0.35

water

0.12

small

0.52

cute

0.78

[0.88, 0.45, 0.05, 0.85, 0.35, 0.12, 0.52, 0.78]

Similar things have similar vectors

cat

dog

landscape

meal

Cosine Similarity:

1.00

0.95

0.29

0.34

0.95

1.00

0.48

0.39

0.29

0.48

1.00

0.38

0.34

0.39

0.38

1.00

Key Insight:

Cat and Dog vectors are similar (0.92) because they're both animals. Sunset is very different from both (0.15-0.20).

Search by text (same vector space)

Text query gets embedded too:

Visualizing in 2D (t-SNE projection)

Similar items cluster together in vector space

Use Cases

✓Semantic document search
✓RAG retrieval
✓Duplicate detection
✓Clustering and topic modeling

Architectural Patterns

Sentence Transformers

Use models trained specifically for sentence/paragraph embedding.

Pros:

+Optimized for retrieval
+Fast inference
+Many specialized variants

Cons:

-Fixed context length
-May need domain fine-tuning

LLM Embeddings via API

Use embedding endpoints from OpenAI, Cohere, Voyage, etc.

Pros:

+High quality
+Long context
+No infrastructure

Cons:

-Cost per token
-Data leaves your system

Sparse + Dense Hybrid

Combine BM25 with dense embeddings for better recall.

Pros:

+Handles exact matches
+More robust
+Better for rare terms

Cons:

-More complex pipeline
-Two indices to maintain

Implementations

API Services

OpenAI text-embedding-3-large

OpenAI

API

3072 dimensions, strong general performance. Supports dimensionality reduction.

Cohere embed-v3

Cohere

API

1024 dimensions, excellent multilingual. Search and classification variants.

Voyage AI voyage-3

Voyage AI

API

Top MTEB performance. Code and law specialized variants.

Open Source

BGE-M3

MIT

Open Source

Multi-lingual, multi-granularity. Dense + sparse in one model.

HuggingFace

E5-Mistral-7B-Instruct

MIT

Open Source

LLM-based embeddings. Excellent for long documents.

HuggingFace

GTE-Qwen2

Apache 2.0

Open Source

Strong all-rounder, good multilingual support.

HuggingFace

Benchmarks

MTEB Leaderboard →BEIR →

Code Examples

Text Embedding with OpenAI

Use OpenAI's embedding API for high-quality vectors

Install:pip install openai

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model='text-embedding-3-large'):
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

# Embed documents
documents = [
    'The cat sat on the mat',
    'A dog played in the park',
    'Machine learning is fascinating'
]

embeddings = [get_embedding(doc) for doc in documents]
print(f'Embedding dimension: {len(embeddings[0])}')

Local Embedding with Sentence Transformers

Run embeddings locally with no API costs

Install:pip install sentence-transformers

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a top-performing model from MTEB
model = SentenceTransformer('BAAI/bge-large-en-v1.5')

# Embed documents
documents = [
    'The cat sat on the mat',
    'A dog played in the park',
    'Machine learning is fascinating'
]

embeddings = model.encode(documents, normalize_embeddings=True)

# Compute cosine similarity
query = 'pets resting at home'
query_embedding = model.encode(query, normalize_embeddings=True)

similarities = np.dot(embeddings, query_embedding)
for doc, sim in zip(documents, similarities):
    print(f'{sim:.3f}: {doc}')

Build a Semantic Search Index

Create a searchable vector database with FAISS

Install:pip install sentence-transformers faiss-cpu numpy

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load model
model = SentenceTransformer('BAAI/bge-large-en-v1.5')

# Your documents
documents = [
    'Python is a programming language',
    'JavaScript runs in the browser',
    'Machine learning uses neural networks',
    'Databases store information',
    'APIs enable communication between services'
]

# Create embeddings
embeddings = model.encode(documents, normalize_embeddings=True)
embeddings = np.array(embeddings).astype('float32')

# Build FAISS index
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)

# Search
query = 'how to store data'
query_vec = model.encode([query], normalize_embeddings=True).astype('float32')

D, I = index.search(query_vec, k=3)
print('Top 3 results:')
for score, idx in zip(D[0], I[0]):
    print(f'  {score:.3f}: {documents[idx]}')

Quick Facts

Input: Text
Output: Vector
Implementations: 3 open source, 3 API
Patterns: 3 approaches

Related Blocks

Image Embedding

Image → Vector

Document Extraction

Document → Structured Data

Have benchmark data?

Help us track the state of the art for text embedding.

Submit Results