Home/Building Blocks/Cross-Encoder Reranking

Text→Structured Data

Cross-Encoder Reranking

Re-score retrieved passages with a cross-encoder to boost search precision.

How Text Reranking Works

Why fast retrieval is not enough, and how reranking fixes the precision problem.

The Speed vs Accuracy Tradeoff

Vector search (bi-encoders) is fast because it pre-computes document embeddings. But this speed comes at a cost: the model never sees your query and document together. It can only compare their embeddings, missing nuanced semantic matches.

BI-ENCODER APPROACH

Query

--embed separately--

Doc

Compare embeddings with cosine similarity

The Reranking Solution

Cross-encoders see the query and document together in one forward pass. This allows the model to understand the relationship between them directly, catching subtle relevance signals that embeddings miss.

CROSS-ENCODER APPROACH

[Query] [SEP] [Document]

Joint attention over both texts

The Key Insight

Reranking is a two-stage process: first, cast a wide net with fast bi-encoder retrieval (fetch 100+ candidates). Then, use a slower but more accurate cross-encoder to rerank just the top candidates. You get the speed of vector search with the precision of joint encoding.

Bi-Encoder vs Cross-Encoder: The Architecture Difference

These two architectures make fundamentally different tradeoffs. Understanding when to use each is key.

Bi-Encoder (for retrieval)

Query

Encoder

[0.2, -0.5, ...]

cosine

similarity

0.82

Document

Encoder

[0.3, -0.4, ...]

+Pre-compute doc embeddings (fast at query time)

+Scales to millions of documents

-No cross-attention between query and doc

Cross-Encoder (for reranking)

Query

[SEP]

Document

Cross-Attention

Query tokens attend to Doc tokens

0.92

relevance score

+Full attention between query and document

+Much more accurate relevance scores

-Must run inference for each query-doc pair

Computational Complexity

Bi-Encoder:O(1) per queryafter indexing

Cross-Encoder:O(n) per queryn = number of documents to score

Interactive: Watch Reranking in Action

See how the ranking changes when we apply a cross-encoder to the bi-encoder results.

QUERY

How do I use Python for machine learning?

Retrieve top:

Return top:

Python is a high-level programming language known for its readability and simplicity.

RelevantTopic: programming

0.82

Bi-Encoder

The python snake is a large non-venomous reptile found in Africa, Asia, and Australia.

Not RelevantTopic: animals

0.79

Bi-Encoder

Python web frameworks like Django and Flask make building applications faster.

RelevantTopic: programming

0.76

Bi-Encoder

Machine learning in Python uses libraries like scikit-learn, TensorFlow, and PyTorch.

RelevantTopic: programming

0.73

Bi-Encoder

Ball pythons are popular pets because they are docile and easy to care for.

Not RelevantTopic: animals

0.71

Bi-Encoder

Python data analysis relies heavily on pandas, numpy, and matplotlib.

RelevantTopic: programming

0.68

Bi-Encoder

Initial retrieval: The bi-encoder ranks documents by embedding similarity. Notice how snake-related documents sneak in because "Python" the language and "python" the snake have similar embeddings. The model cannot distinguish context.

The Two-Stage Retrieval Pipeline

Reranking is not a replacement for vector search - it is a refinement layer on top.

Query

User's question

-->

Retrieve

Fast vector search (bi-encoder)

-->

Rerank

Precise scoring (cross-encoder)

-->

Return

Score Calibration: Making Scores Meaningful

Raw model scores are not probabilities. Calibration transforms them into interpretable relevance scores.

The Problem with Raw Scores

Model A score: 2.34What does this mean?

Model B score: -0.82Is this relevant or not?

Cosine similarity: 0.73Good threshold varies by model

Different models use different score scales. A score of 0.7 might be excellent for one model but mediocre for another. Cross-encoders often output logits that need sigmoid normalization.

Calibration Techniques

Sigmoid Normalization

Map logits to 0-1 probability range

score = 1 / (1 + exp(-logit))

Temperature Scaling

Sharpen or soften score distribution

score = softmax(logits / temperature)

Min-Max Normalization

Scale to 0-1 based on batch

score = (x - min) / (max - min)

Setting Relevance Thresholds

After calibration, you can set meaningful thresholds. Common approach: evaluate on a labeled dataset to find the threshold that maximizes F1 score or achieves desired precision/recall.

Score >= 0.7: Highly relevant

Score 0.4-0.7: Possibly relevant

Score < 0.4: Not relevant

Reranking Methods Compared

From managed APIs to self-hosted models, choose based on your latency, accuracy, and cost requirements.

Click a method above to see details and code examples

Quick Comparison

Common Pitfalls

Reranking too few candidates

If your initial retrieval only fetches 10 documents and the best one is at position 15, reranking will not help. Retrieve more candidates than you need.

Fix: Retrieve 50-200 candidates, rerank to top 5-10

Ignoring document length

Cross-encoders have max sequence lengths (often 512 tokens). Long documents get truncated, potentially cutting off the relevant part.

Fix: Chunk documents, rerank chunks, or use models with longer context

Mismatched training domains

Most rerankers are trained on web search or MS MARCO data. If your domain is very different (e.g., medical, legal), out-of-the-box performance may suffer.

Fix: Fine-tune on your domain or use a model trained on similar data

Not batching inference

Running cross-encoder inference one document at a time is slow. GPU utilization tanks.

Fix: Batch all query-document pairs and run inference together

The Complete Picture

Query

-->

Bi-Encoder (fast recall)

-->

50-200 candidates

-->

Cross-Encoder (precise)

-->

Top 3-10 results

Text reranking solves the precision problem in retrieval systems. Bi-encoders are fast but approximate. Cross-encoders are slow but precise. By using both in a two-stage pipeline, you get the best of both worlds: sub-second latency with state-of-the-art relevance ranking. Whether you use Cohere's API, BGE Reranker, or a fine-tuned cross-encoder, the key insight remains the same - joint attention between query and document captures semantic relationships that embedding comparison cannot.