Home/Building Blocks/Cross-Encoder Reranking
TextStructured Data

Cross-Encoder Reranking

Re-score retrieved passages with a cross-encoder to boost search precision.

How Text Reranking Works

Why fast retrieval is not enough, and how reranking fixes the precision problem.

The Speed vs Accuracy Tradeoff

Vector search (bi-encoders) is fast because it pre-computes document embeddings. But this speed comes at a cost: the model never sees your query and document together. It can only compare their embeddings, missing nuanced semantic matches.

BI-ENCODER APPROACH
Query
--embed separately--
Doc
Compare embeddings with cosine similarity
The Reranking Solution

Cross-encoders see the query and document together in one forward pass. This allows the model to understand the relationship between them directly, catching subtle relevance signals that embeddings miss.

CROSS-ENCODER APPROACH
[Query] [SEP] [Document]
Joint attention over both texts
The Key Insight
Reranking is a two-stage process: first, cast a wide net with fast bi-encoder retrieval (fetch 100+ candidates). Then, use a slower but more accurate cross-encoder to rerank just the top candidates. You get the speed of vector search with the precision of joint encoding.
1

Bi-Encoder vs Cross-Encoder: The Architecture Difference

These two architectures make fundamentally different tradeoffs. Understanding when to use each is key.

Bi-Encoder (for retrieval)
Query
|
Encoder
|
[0.2, -0.5, ...]
cosine
similarity
0.82
Document
|
Encoder
|
[0.3, -0.4, ...]
+Pre-compute doc embeddings (fast at query time)
+Scales to millions of documents
-No cross-attention between query and doc
Cross-Encoder (for reranking)
Query
[SEP]
Document
|
Cross-Attention
Query tokens attend to Doc tokens
|
0.92
relevance score
+Full attention between query and document
+Much more accurate relevance scores
-Must run inference for each query-doc pair
Computational Complexity
Bi-Encoder:O(1) per queryafter indexing
Cross-Encoder:O(n) per queryn = number of documents to score
2

Interactive: Watch Reranking in Action

See how the ranking changes when we apply a cross-encoder to the bi-encoder results.

QUERY
How do I use Python for machine learning?
Retrieve top:
Return top:
1
Python is a high-level programming language known for its readability and simplicity.
RelevantTopic: programming
0.82
Bi-Encoder
2
The python snake is a large non-venomous reptile found in Africa, Asia, and Australia.
Not RelevantTopic: animals
0.79
Bi-Encoder
3
Python web frameworks like Django and Flask make building applications faster.
RelevantTopic: programming
0.76
Bi-Encoder
4
Machine learning in Python uses libraries like scikit-learn, TensorFlow, and PyTorch.
RelevantTopic: programming
0.73
Bi-Encoder
5
Ball pythons are popular pets because they are docile and easy to care for.
Not RelevantTopic: animals
0.71
Bi-Encoder
6
Python data analysis relies heavily on pandas, numpy, and matplotlib.
RelevantTopic: programming
0.68
Bi-Encoder
Initial retrieval: The bi-encoder ranks documents by embedding similarity. Notice how snake-related documents sneak in because "Python" the language and "python" the snake have similar embeddings. The model cannot distinguish context.
3

The Two-Stage Retrieval Pipeline

Reranking is not a replacement for vector search - it is a refinement layer on top.

1
Query
User's question
2
Retrieve
Fast vector search (bi-encoder)
3
Rerank
Precise scoring (cross-encoder)
4
Return
Top results to LLM/user
Stage 1: Retrieve
  • - Use bi-encoder (e.g., E5, BGE, OpenAI embeddings)
  • - Fetch top 50-200 candidates from vector DB
  • - Latency: ~10-50ms for millions of docs
  • - Goal: High recall (don't miss relevant docs)
Stage 2: Rerank
  • - Use cross-encoder on retrieved candidates
  • - Score each query-doc pair independently
  • - Latency: ~20-100ms for 50 docs
  • - Goal: High precision (rank best docs first)
Stage 3: Return
  • - Take top-k reranked results (k=3-10)
  • - Pass to LLM as context for RAG
  • - Or display directly to user in search
  • - Total latency: ~50-200ms end-to-end
Why not just use cross-encoder for retrieval?
Cross-encoders require a forward pass for every query-document pair. With 1 million documents, that is 1 million inference calls per query. At ~30ms each, that is 8+ hours per query. Bi-encoders pre-compute document embeddings, so query time is just one embedding + vector search.
4

Score Calibration: Making Scores Meaningful

Raw model scores are not probabilities. Calibration transforms them into interpretable relevance scores.

The Problem with Raw Scores
Model A score: 2.34What does this mean?
Model B score: -0.82Is this relevant or not?
Cosine similarity: 0.73Good threshold varies by model

Different models use different score scales. A score of 0.7 might be excellent for one model but mediocre for another. Cross-encoders often output logits that need sigmoid normalization.

Calibration Techniques
Sigmoid Normalization
Map logits to 0-1 probability range
score = 1 / (1 + exp(-logit))
Temperature Scaling
Sharpen or soften score distribution
score = softmax(logits / temperature)
Min-Max Normalization
Scale to 0-1 based on batch
score = (x - min) / (max - min)
Setting Relevance Thresholds

After calibration, you can set meaningful thresholds. Common approach: evaluate on a labeled dataset to find the threshold that maximizes F1 score or achieves desired precision/recall.

Score >= 0.7: Highly relevant
Score 0.4-0.7: Possibly relevant
Score < 0.4: Not relevant
5

Reranking Methods Compared

From managed APIs to self-hosted models, choose based on your latency, accuracy, and cost requirements.

Click a method above to see details and code examples
Quick Comparison
!

Common Pitfalls

1
Reranking too few candidates
If your initial retrieval only fetches 10 documents and the best one is at position 15, reranking will not help. Retrieve more candidates than you need.
Fix: Retrieve 50-200 candidates, rerank to top 5-10
2
Ignoring document length
Cross-encoders have max sequence lengths (often 512 tokens). Long documents get truncated, potentially cutting off the relevant part.
Fix: Chunk documents, rerank chunks, or use models with longer context
3
Mismatched training domains
Most rerankers are trained on web search or MS MARCO data. If your domain is very different (e.g., medical, legal), out-of-the-box performance may suffer.
Fix: Fine-tune on your domain or use a model trained on similar data
4
Not batching inference
Running cross-encoder inference one document at a time is slow. GPU utilization tanks.
Fix: Batch all query-document pairs and run inference together

The Complete Picture

Query
-->
Bi-Encoder (fast recall)
-->
50-200 candidates
-->
Cross-Encoder (precise)
-->
Top 3-10 results

Text reranking solves the precision problem in retrieval systems. Bi-encoders are fast but approximate. Cross-encoders are slow but precise. By using both in a two-stage pipeline, you get the best of both worlds: sub-second latency with state-of-the-art relevance ranking. Whether you use Cohere's API, BGE Reranker, or a fine-tuned cross-encoder, the key insight remains the same - joint attention between query and document captures semantic relationships that embedding comparison cannot.

Use Cases

  • RAG retrieval quality
  • E-commerce search
  • Legal/medical search
  • Recommendations

Architectural Patterns

Bi-encoder + Cross-encoder

Retrieve many with dense vectors, rerank top-k with cross-encoder.

Late Interaction

Efficient token-level scoring (ColBERT-style).

Implementations

API Services

Cohere Rerank

Cohere
API

State-of-the-art reranker for English + multilingual.

Open Source

BGE-Reranker

MIT
Open Source

Strong open reranker for RAG.

monoT5

Apache 2.0
Open Source

Text-to-text reranker on MS MARCO.

Benchmarks

Quick Facts

Input
Text
Output
Structured Data
Implementations
2 open source, 1 API
Patterns
2 approaches

Have benchmark data?

Help us track the state of the art for cross-encoder reranking.

Submit Results