Home/Building Blocks/Text Embedding
TextVector

Text Embedding

Convert text into dense vector representations for semantic search, clustering, and retrieval.

How Text Embedding Works

Neural networks convert images and text into vectors (lists of numbers). Similar concepts have similar vectors.

1

Image becomes a vector

Photo of a cat
Photo of a cat
768 dimensions (showing 8)
animal
0.82
outdoor
0.15
food
0.03
furry
0.91
wild
0.22
water
0.08
small
0.67
cute
0.44
[0.82, 0.15, 0.03, 0.91, 0.22, 0.08, 0.67, 0.44]
Photo of a dog
Photo of a dog
768 dimensions (showing 8)
animal
0.88
outdoor
0.45
food
0.05
furry
0.85
wild
0.35
water
0.12
small
0.52
cute
0.78
[0.88, 0.45, 0.05, 0.85, 0.35, 0.12, 0.52, 0.78]
2

Similar things have similar vectors

Photo of a cat
cat
Photo of a dog
dog
Sunset landscape
landscape
Delicious meal
meal
Cosine Similarity:
1.00
0.95
0.29
0.34
0.95
1.00
0.48
0.39
0.29
0.48
1.00
0.38
0.34
0.39
0.38
1.00
Key Insight:
Cat and Dog vectors are similar (0.92) because they're both animals. Sunset is very different from both (0.15-0.20).
3

Search by text (same vector space)

Text query gets embedded too:
4

Visualizing in 2D (t-SNE projection)

AnimalsLandscapesFood
Similar items cluster together in vector space

Use Cases

  • Semantic document search
  • RAG retrieval
  • Duplicate detection
  • Clustering and topic modeling

Architectural Patterns

Sentence Transformers

Use models trained specifically for sentence/paragraph embedding.

Pros:
  • +Optimized for retrieval
  • +Fast inference
  • +Many specialized variants
Cons:
  • -Fixed context length
  • -May need domain fine-tuning

LLM Embeddings via API

Use embedding endpoints from OpenAI, Cohere, Voyage, etc.

Pros:
  • +High quality
  • +Long context
  • +No infrastructure
Cons:
  • -Cost per token
  • -Data leaves your system

Sparse + Dense Hybrid

Combine BM25 with dense embeddings for better recall.

Pros:
  • +Handles exact matches
  • +More robust
  • +Better for rare terms
Cons:
  • -More complex pipeline
  • -Two indices to maintain

Implementations

API Services

OpenAI text-embedding-3-large

OpenAI
API

3072 dimensions, strong general performance. Supports dimensionality reduction.

Cohere embed-v3

Cohere
API

1024 dimensions, excellent multilingual. Search and classification variants.

Voyage AI voyage-3

Voyage AI
API

Top MTEB performance. Code and law specialized variants.

Open Source

BGE-M3

MIT
Open Source

Multi-lingual, multi-granularity. Dense + sparse in one model.

E5-Mistral-7B-Instruct

MIT
Open Source

LLM-based embeddings. Excellent for long documents.

GTE-Qwen2

Apache 2.0
Open Source

Strong all-rounder, good multilingual support.

Benchmarks

Code Examples

Text Embedding with OpenAI

Use OpenAI's embedding API for high-quality vectors

Install:pip install openai
from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model='text-embedding-3-large'):
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

# Embed documents
documents = [
    'The cat sat on the mat',
    'A dog played in the park',
    'Machine learning is fascinating'
]

embeddings = [get_embedding(doc) for doc in documents]
print(f'Embedding dimension: {len(embeddings[0])}')

Local Embedding with Sentence Transformers

Run embeddings locally with no API costs

Install:pip install sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np

# Load a top-performing model from MTEB
model = SentenceTransformer('BAAI/bge-large-en-v1.5')

# Embed documents
documents = [
    'The cat sat on the mat',
    'A dog played in the park',
    'Machine learning is fascinating'
]

embeddings = model.encode(documents, normalize_embeddings=True)

# Compute cosine similarity
query = 'pets resting at home'
query_embedding = model.encode(query, normalize_embeddings=True)

similarities = np.dot(embeddings, query_embedding)
for doc, sim in zip(documents, similarities):
    print(f'{sim:.3f}: {doc}')

Build a Semantic Search Index

Create a searchable vector database with FAISS

Install:pip install sentence-transformers faiss-cpu numpy
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load model
model = SentenceTransformer('BAAI/bge-large-en-v1.5')

# Your documents
documents = [
    'Python is a programming language',
    'JavaScript runs in the browser',
    'Machine learning uses neural networks',
    'Databases store information',
    'APIs enable communication between services'
]

# Create embeddings
embeddings = model.encode(documents, normalize_embeddings=True)
embeddings = np.array(embeddings).astype('float32')

# Build FAISS index
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)

# Search
query = 'how to store data'
query_vec = model.encode([query], normalize_embeddings=True).astype('float32')

D, I = index.search(query_vec, k=3)
print('Top 3 results:')
for score, idx in zip(D[0], I[0]):
    print(f'  {score:.3f}: {documents[idx]}')

Quick Facts

Input
Text
Output
Vector
Implementations
3 open source, 3 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for text embedding.

Submit Results