IRPAPERS: Text vs Image Retrieval on Scientific Documents
A new benchmark of 3,230 pages from 166 papers reveals that text and image retrieval have complementary failures — and combining them pushes Recall@20 to 95%.
TL;DR
- Cohere Embed v4 leads retrieval at 58% Recall@1, 97% Recall@20
- Multimodal hybrid search outperforms either modality alone (49% R@1, 95% R@20)
- Text RAG beats image RAG on QA alignment: 0.82 vs 0.71
- Complementary failures — text loses tables/figures, images lose dense prose
- Dataset: 3,230 pages, 166 papers, 180 needle-in-the-haystack questions
What is IRPAPERS?
Most RAG benchmarks assume text-only input. But scientific papers contain tables, equations, figures, and complex layouts that OCR often mangles. IRPAPERS is the first benchmark designed to systematically compare image-based vs text-based retrieval and QA on real scientific documents.
Created by the Weaviate team (Connor Shorten, Bob van Luijt, et al.), the benchmark provides dual representations of every page — a document image and an OCR transcription — enabling controlled experiments across modalities.
How It Works: Two Pipelines, One Question
The same question is answered through two parallel pipelines. Each page exists as both an image and a text transcription. The benchmark measures which representation retrieves the right page and produces the right answer.
Retrieval Performance
Recall at different cutoffs across retrieval methods. Cohere Embed v4 dominates, but the open-source multimodal hybrid is competitive — and significantly outperforms either single modality.
Key insight: Image retrieval matches text at Recall@5 (both 78%) and surpasses it at Recall@20 (93% vs 91%). The modalities are genuinely complementary — combining them yields the best results at every cutoff.
Embedding Models Compared
Five embedding approaches tested on the IRPAPERS corpus, spanning closed-source multimodal, open-source text, and open-source vision models.
| Model | Type | Modality | R@1 | R@20 | Notes |
|---|---|---|---|---|---|
| Cohere Embed v4 | Closed-source | Multimodal | 58% | 97% | Best overall. Native multimodal embeddings. |
| Arctic 2.0 | Open-source | Text | 46% | 91% | Best open-source text embeddings. Used with BM25 hybrid. |
| Voyage 3 Large | Closed-source | Text | 44% | 90% | Strong text retrieval, competitive with Arctic. |
| ColPali / ColQwen | Open-source | Image | 43% | 93% | Multi-vector image embeddings. Strong at high recall. |
| MUVERA | Open-source | Image | 39% | 88% | Single-vector approximation of multi-vector. Faster inference. |
MUVERA provides a useful efficiency tradeoff — single-vector approximation of multi-vector ColPali embeddings, trading ~4% Recall@1 for significantly faster inference.
Question Answering Performance
Beyond retrieval, the benchmark measures end-to-end QA quality by comparing generated answers against ground truth. Text-based RAG leads here, with a meaningful 15.5% advantage.
OCR transcription + text embeddings + LLM
Page images + vision embeddings + VLM
Why text leads QA: LLMs are better at reasoning over text than vision models are at reasoning over images. Even when image retrieval finds the right page, the VLM often struggles to extract the precise answer from the visual layout.
Complementary Failures
The paper's most important finding: text and image retrieval fail on different questions. This is why hybrid search works — each modality covers the other's blind spots.
OCR loses spatial layout, flattens structure
Text extraction merges columns incorrectly
Vision models struggle with long text spans
Resolution limitations at page level
Neither modality links content across pages
Requires domain understanding beyond the page
Which Modality Wins by Question Type?
Different question types favor different retrieval strategies. Understanding this mapping helps you choose the right approach for your use case.
| Question Type | Example | Best Modality | Why |
|---|---|---|---|
| Factual Extraction | What learning rate was used for fine-tuning? | Text | OCR captures exact numbers reliably |
| Table/Figure Reading | Which model achieved the highest F1 in Table 3? | Image | Layout structure preserved in images |
| Cross-reference | What does the ablation in Section 4.2 show about component X? | Hybrid | Requires both text understanding and layout context |
| Equation Understanding | What is the loss function defined in Equation 5? | Image | LaTeX rendering lost in OCR transcription |
| Methodology Details | How many layers does the encoder have? | Text | Buried in prose, text search excels |
| Visual Comparison | Which architecture diagram shows skip connections? | Image | Diagrams are purely visual content |
Practical Implications for RAG Systems
If you're building document RAG
Don't rely on text-only. Scientific documents lose 13% of retrievable information through OCR. Store both page images and text, and fuse retrieval results. The +6% hybrid boost on R@20 compounds across large corpora.
If you need precision (R@1)
Use Cohere Embed v4 if budget allows — it's 12 percentage points ahead of the best open-source option at R@1 (58% vs 46%). For open-source, multimodal hybrid narrows the gap to 49%.
If you need recall (R@20)
Image retrieval actually beats text at R@20 (93% vs 91%). ColPali/ColQwen embeddings capture layout information that text misses. For reranking pipelines, start with image retrieval for the candidate set.
If you need answer quality
Text-based RAG still produces better answers (0.82 vs 0.71 alignment). Even with perfect image retrieval, current VLMs lag behind text LLMs at extracting precise information from visual layouts.
The Retrieval Depth Effect
One of the paper's clearest findings: more retrieval context helps both modalities significantly. Going from Recall@1 to Recall@20, text improves by 45 percentage points and images by 50.
- Text: +45pp from R@1 to R@20
- 46% → 78% → 91%. BM25 keyword matching rescues queries that embedding similarity misses.
- Image: +50pp from R@1 to R@20
- 43% → 78% → 93%. Multi-vector embeddings (ColPali) have high variance per-query — more candidates smooth this out.
- Hybrid: +46pp from R@1 to R@20
- 49% → 81% → 95%. Diminishing returns are slower for hybrid — each additional result is more likely to be genuinely relevant.
About the Dataset
- Needle-in-the-Haystack Design
- Each of the 180 questions targets a specific page in a specific paper. The system must retrieve the exact right page from 3,230 candidates — a realistic scenario for research assistants.
- Dual Representation
- Every page exists as both a rendered image (as you'd see in a PDF viewer) and an OCR text transcription. This controlled setup isolates the effect of modality from other variables.
- Open Access
- The IRPAPERS dataset and all experimental code are publicly available under CC BY 4.0. Licensed for research and commercial use.
Key Takeaways
- 1.Neither modality dominates — text wins QA (0.82 vs 0.71), images win high-recall retrieval (93% vs 91% R@20)
- 2.Hybrid search is the answer — combining text and image retrieval yields 95% R@20, +4pp over the best single modality
- 3.Cohere Embed v4 is the best single model — 58% R@1 with native multimodal embeddings, 12pp ahead of open-source
- 4.Retrieval depth matters enormously — going from top-1 to top-20 adds 45-50 percentage points across all methods
- 5.OCR is lossy — tables, figures, equations, and multi-column layouts degrade significantly through text extraction
- 6.VLMs still lag for QA — even with perfect retrieval, vision-language models produce less accurate answers than text LLMs
Implementation Examples
Cohere Embed v4 (Best: 58% R@1, 97% R@20)
Native multimodal embeddings — text and images share the same embedding space. No separate pipelines needed.
import cohere
co = cohere.ClientV2(api_key="YOUR_API_KEY")
# Cohere Embed v4 — best single-model performance on IRPAPERS
# Natively handles both text and images in one embedding space
# Embed a text query
query_emb = co.embed(
texts=["What is the loss function defined in Equation 5?"],
model="embed-v4.0",
input_type="search_query",
embedding_types=["float"],
).embeddings.float_[0]
# Embed document pages (text)
doc_embs = co.embed(
texts=["Page 1 OCR text...", "Page 2 OCR text..."],
model="embed-v4.0",
input_type="search_document",
embedding_types=["float"],
).embeddings.float_
# Embed document pages (images) — same model, same space
import base64
with open("paper_page_7.png", "rb") as f:
img_b64 = base64.standard_b64encode(f.read()).decode("utf-8")
img_embs = co.embed(
images=[img_b64],
model="embed-v4.0",
input_type="image",
embedding_types=["float"],
).embeddings.float_
# Compute similarity — text query against both text and image embeddings
import numpy as np
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Text similarity: {cosine_sim(query_emb, doc_embs[0]):.3f}")
print(f"Image similarity: {cosine_sim(query_emb, img_embs[0]):.3f}")Text Hybrid Search (Weaviate + Arctic 2.0) (46% R@1)
BM25 + vector hybrid search on OCR transcriptions. The approach used in the IRPAPERS benchmark.
import weaviate
from weaviate.classes.query import MetadataQuery
client = weaviate.connect_to_weaviate_cloud(
cluster_url="YOUR_CLUSTER_URL",
auth_credentials=weaviate.auth.AuthApiKey("YOUR_API_KEY"),
)
papers = client.collections.get("ScientificPages")
# Text-based hybrid search (BM25 + vector)
response = papers.query.hybrid(
query="What learning rate was used for fine-tuning BERT?",
alpha=0.5, # Balance between BM25 and vector search
limit=20,
return_metadata=MetadataQuery(score=True),
target_vector="text_embedding", # Arctic 2.0 embeddings
)
for obj in response.objects:
print(f"Paper: {obj.properties['paper_title']}")
print(f"Page: {obj.properties['page_number']}")
print(f"Score: {obj.metadata.score:.3f}")
print(f"Text: {obj.properties['ocr_text'][:200]}...")
print()ColQwen2 Image Retrieval (Open-source: 43% R@1, 93% R@20)
Multi-vector vision embeddings from page images. No OCR needed — query directly against rendered pages.
from colpali_engine.models import ColQwen2, ColQwen2Processor
from PIL import Image
import torch
# Open-source image embeddings (R@1: 43%, R@20: 93%)
model = ColQwen2.from_pretrained(
"vidore/colqwen2-v1.0",
torch_dtype=torch.float16,
device_map="cuda",
)
processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0")
# Process queries
queries = ["What learning rate was used for fine-tuning?"]
query_inputs = processor.process_queries(queries).to("cuda")
with torch.no_grad():
query_embeddings = model(**query_inputs) # Multi-vector per query
# Process page images
images = [Image.open(f"page_{i}.png") for i in range(1, 21)]
image_inputs = processor.process_images(images).to("cuda")
with torch.no_grad():
image_embeddings = model(**image_inputs) # Multi-vector per image
# Score via late interaction (MaxSim)
scores = processor.score_multi_vector(query_embeddings, image_embeddings)
ranked = sorted(enumerate(scores[0].tolist()), key=lambda x: -x[1])
for idx, score in ranked[:5]:
print(f"Page {idx+1}: {score:.3f}")Image Retrieval (Weaviate) (43% R@1)
Image-based vector search using stored ColPali embeddings in Weaviate.
import weaviate
import base64
client = weaviate.connect_to_weaviate_cloud(
cluster_url="YOUR_CLUSTER_URL",
auth_credentials=weaviate.auth.AuthApiKey("YOUR_API_KEY"),
)
papers = client.collections.get("ScientificPages")
# Image-based retrieval using ColPali/ColQwen embeddings
response = papers.query.near_text(
query="architecture diagram with skip connections",
limit=20,
target_vector="image_embedding", # ColPali multi-vector embeddings
)
for obj in response.objects:
print(f"Paper: {obj.properties['paper_title']}")
print(f"Page: {obj.properties['page_number']}")
# Page image available as base64
img_b64 = obj.properties.get('page_image_b64')
if img_b64:
with open(f"page_{obj.properties['page_number']}.png", "wb") as f:
f.write(base64.b64decode(img_b64))Multimodal Hybrid Fusion (49% R@1, 95% R@20)
Reciprocal Rank Fusion to combine text and image retrieval — the winning strategy from the benchmark.
import numpy as np
from typing import List, Tuple
def reciprocal_rank_fusion(
text_results: List[Tuple[str, float]], # (doc_id, score)
image_results: List[Tuple[str, float]],
k: int = 60, # RRF constant
text_weight: float = 0.5,
) -> List[Tuple[str, float]]:
"""
Fuse text and image retrieval results using Reciprocal Rank Fusion.
On IRPAPERS: boosts R@1 from 46%/43% to 49%, R@20 from 91%/93% to 95%.
"""
scores = {}
for rank, (doc_id, _) in enumerate(text_results):
scores[doc_id] = scores.get(doc_id, 0) + text_weight / (k + rank + 1)
for rank, (doc_id, _) in enumerate(image_results):
scores[doc_id] = scores.get(doc_id, 0) + (1 - text_weight) / (k + rank + 1)
return sorted(scores.items(), key=lambda x: -x[1])
# Example: fuse Arctic 2.0 text search with ColPali image search
text_hits = [("page_42", 0.89), ("page_17", 0.84), ("page_103", 0.81)]
image_hits = [("page_17", 0.92), ("page_42", 0.78), ("page_55", 0.75)]
fused = reciprocal_rank_fusion(text_hits, image_hits)
for doc_id, score in fused[:5]:
print(f"{doc_id}: {score:.4f}")References
- IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering (arXiv 2602.17687, 2026)
- ColPali: Efficient Document Retrieval with Vision Language Models (arXiv 2407.01449)
- Cohere Embed v4: Multimodal Embeddings
- Arctic Embed: Scalable, Efficient, and Accurate Text Embedding Models