Home / Guides / IRPAPERS
Benchmark Deep Dive

IRPAPERS: Text vs Image Retrieval on Scientific Documents

A new benchmark of 3,230 pages from 166 papers reveals that text and image retrieval have complementary failures — and combining them pushes Recall@20 to 95%.

March 2026|12 min read|arXiv 2602.17687

TL;DR

  • Cohere Embed v4 leads retrieval at 58% Recall@1, 97% Recall@20
  • Multimodal hybrid search outperforms either modality alone (49% R@1, 95% R@20)
  • Text RAG beats image RAG on QA alignment: 0.82 vs 0.71
  • Complementary failures — text loses tables/figures, images lose dense prose
  • Dataset: 3,230 pages, 166 papers, 180 needle-in-the-haystack questions

What is IRPAPERS?

Most RAG benchmarks assume text-only input. But scientific papers contain tables, equations, figures, and complex layouts that OCR often mangles. IRPAPERS is the first benchmark designed to systematically compare image-based vs text-based retrieval and QA on real scientific documents.

3,230
Pages indexed
Both image and OCR versions
166
Scientific papers
Diverse ML/AI topics
180
Evaluation questions
Needle-in-the-haystack

Created by the Weaviate team (Connor Shorten, Bob van Luijt, et al.), the benchmark provides dual representations of every page — a document image and an OCR transcription — enabling controlled experiments across modalities.

How It Works: Two Pipelines, One Question

The same question is answered through two parallel pipelines. Each page exists as both an image and a text transcription. The benchmark measures which representation retrieves the right page and produces the right answer.

Text vs Image Retrieval PipelinesTEXT PIPELINEPDF Pages3,230 pagesOCR ExtractText transcriptionText EmbeddingsArctic 2.0 + BM25RetrieveR@1: 46%LLM0.82 alignIMAGE PIPELINEPDF Pages3,230 imagesPage ImagesDirect visual inputVision EmbeddingsColPali / ColQwenRetrieveR@1: 43%VLM0.71 alignHybrid Fusion: R@1 49% | R@20 95%

Retrieval Performance

Recall at different cutoffs across retrieval methods. Cohere Embed v4 dominates, but the open-source multimodal hybrid is competitive — and significantly outperforms either single modality.

Retrieval Performance (%) on IRPAPERS — higher is betterRecall@1Recall@5Recall@20Cohere Embed v4Closed-source58%87%97%Multimodal HybridText + Image fusion49%81%95%Text (Arctic 2.0 + BM25)Open-source text46%78%91%Image (ColPali-style)Open-source vision43%78%93%

Key insight: Image retrieval matches text at Recall@5 (both 78%) and surpasses it at Recall@20 (93% vs 91%). The modalities are genuinely complementary — combining them yields the best results at every cutoff.

Embedding Models Compared

Five embedding approaches tested on the IRPAPERS corpus, spanning closed-source multimodal, open-source text, and open-source vision models.

ModelTypeModalityR@1R@20Notes
Cohere Embed v4Closed-sourceMultimodal58%97%Best overall. Native multimodal embeddings.
Arctic 2.0Open-sourceText46%91%Best open-source text embeddings. Used with BM25 hybrid.
Voyage 3 LargeClosed-sourceText44%90%Strong text retrieval, competitive with Arctic.
ColPali / ColQwenOpen-sourceImage43%93%Multi-vector image embeddings. Strong at high recall.
MUVERAOpen-sourceImage39%88%Single-vector approximation of multi-vector. Faster inference.

MUVERA provides a useful efficiency tradeoff — single-vector approximation of multi-vector ColPali embeddings, trading ~4% Recall@1 for significantly faster inference.

Question Answering Performance

Beyond retrieval, the benchmark measures end-to-end QA quality by comparing generated answers against ground truth. Text-based RAG leads here, with a meaningful 15.5% advantage.

QA Ground-Truth Alignment — higher is betterText RAG0.82Image RAG0.71Text RAG leads by +15.5% on ground-truth alignment
0.82Text-based RAG

OCR transcription + text embeddings + LLM

0.71Image-based RAG

Page images + vision embeddings + VLM

Why text leads QA: LLMs are better at reasoning over text than vision models are at reasoning over images. Even when image retrieval finds the right page, the VLM often struggles to extract the precise answer from the visual layout.

Complementary Failures

The paper's most important finding: text and image retrieval fail on different questions. This is why hybrid search works — each modality covers the other's blind spots.

Text fails
Tables, figures, equations, diagrams

OCR loses spatial layout, flattens structure

Text fails
Multi-column layouts

Text extraction merges columns incorrectly

Image fails
Dense prose paragraphs

Vision models struggle with long text spans

Image fails
Small font / footnotes

Resolution limitations at page level

Both fail
Cross-page references

Neither modality links content across pages

Both fail
Implicit knowledge

Requires domain understanding beyond the page

Which Modality Wins by Question Type?

Different question types favor different retrieval strategies. Understanding this mapping helps you choose the right approach for your use case.

Question TypeExampleBest ModalityWhy
Factual ExtractionWhat learning rate was used for fine-tuning?TextOCR captures exact numbers reliably
Table/Figure ReadingWhich model achieved the highest F1 in Table 3?ImageLayout structure preserved in images
Cross-referenceWhat does the ablation in Section 4.2 show about component X?HybridRequires both text understanding and layout context
Equation UnderstandingWhat is the loss function defined in Equation 5?ImageLaTeX rendering lost in OCR transcription
Methodology DetailsHow many layers does the encoder have?TextBuried in prose, text search excels
Visual ComparisonWhich architecture diagram shows skip connections?ImageDiagrams are purely visual content

Practical Implications for RAG Systems

If you're building document RAG

Don't rely on text-only. Scientific documents lose 13% of retrievable information through OCR. Store both page images and text, and fuse retrieval results. The +6% hybrid boost on R@20 compounds across large corpora.

If you need precision (R@1)

Use Cohere Embed v4 if budget allows — it's 12 percentage points ahead of the best open-source option at R@1 (58% vs 46%). For open-source, multimodal hybrid narrows the gap to 49%.

If you need recall (R@20)

Image retrieval actually beats text at R@20 (93% vs 91%). ColPali/ColQwen embeddings capture layout information that text misses. For reranking pipelines, start with image retrieval for the candidate set.

If you need answer quality

Text-based RAG still produces better answers (0.82 vs 0.71 alignment). Even with perfect image retrieval, current VLMs lag behind text LLMs at extracting precise information from visual layouts.

The Retrieval Depth Effect

One of the paper's clearest findings: more retrieval context helps both modalities significantly. Going from Recall@1 to Recall@20, text improves by 45 percentage points and images by 50.

Text: +45pp from R@1 to R@20
46% → 78% → 91%. BM25 keyword matching rescues queries that embedding similarity misses.
Image: +50pp from R@1 to R@20
43% → 78% → 93%. Multi-vector embeddings (ColPali) have high variance per-query — more candidates smooth this out.
Hybrid: +46pp from R@1 to R@20
49% → 81% → 95%. Diminishing returns are slower for hybrid — each additional result is more likely to be genuinely relevant.

About the Dataset

Needle-in-the-Haystack Design
Each of the 180 questions targets a specific page in a specific paper. The system must retrieve the exact right page from 3,230 candidates — a realistic scenario for research assistants.
Dual Representation
Every page exists as both a rendered image (as you'd see in a PDF viewer) and an OCR text transcription. This controlled setup isolates the effect of modality from other variables.
Open Access
The IRPAPERS dataset and all experimental code are publicly available under CC BY 4.0. Licensed for research and commercial use.

Key Takeaways

  • 1.Neither modality dominates — text wins QA (0.82 vs 0.71), images win high-recall retrieval (93% vs 91% R@20)
  • 2.Hybrid search is the answer — combining text and image retrieval yields 95% R@20, +4pp over the best single modality
  • 3.Cohere Embed v4 is the best single model — 58% R@1 with native multimodal embeddings, 12pp ahead of open-source
  • 4.Retrieval depth matters enormously — going from top-1 to top-20 adds 45-50 percentage points across all methods
  • 5.OCR is lossy — tables, figures, equations, and multi-column layouts degrade significantly through text extraction
  • 6.VLMs still lag for QA — even with perfect retrieval, vision-language models produce less accurate answers than text LLMs

Implementation Examples

Cohere Embed v4 (Best: 58% R@1, 97% R@20)

Native multimodal embeddings — text and images share the same embedding space. No separate pipelines needed.

import cohere

co = cohere.ClientV2(api_key="YOUR_API_KEY")

# Cohere Embed v4 — best single-model performance on IRPAPERS
# Natively handles both text and images in one embedding space

# Embed a text query
query_emb = co.embed(
    texts=["What is the loss function defined in Equation 5?"],
    model="embed-v4.0",
    input_type="search_query",
    embedding_types=["float"],
).embeddings.float_[0]

# Embed document pages (text)
doc_embs = co.embed(
    texts=["Page 1 OCR text...", "Page 2 OCR text..."],
    model="embed-v4.0",
    input_type="search_document",
    embedding_types=["float"],
).embeddings.float_

# Embed document pages (images) — same model, same space
import base64
with open("paper_page_7.png", "rb") as f:
    img_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

img_embs = co.embed(
    images=[img_b64],
    model="embed-v4.0",
    input_type="image",
    embedding_types=["float"],
).embeddings.float_

# Compute similarity — text query against both text and image embeddings
import numpy as np
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(f"Text similarity: {cosine_sim(query_emb, doc_embs[0]):.3f}")
print(f"Image similarity: {cosine_sim(query_emb, img_embs[0]):.3f}")

Text Hybrid Search (Weaviate + Arctic 2.0) (46% R@1)

BM25 + vector hybrid search on OCR transcriptions. The approach used in the IRPAPERS benchmark.

import weaviate
from weaviate.classes.query import MetadataQuery

client = weaviate.connect_to_weaviate_cloud(
    cluster_url="YOUR_CLUSTER_URL",
    auth_credentials=weaviate.auth.AuthApiKey("YOUR_API_KEY"),
)

papers = client.collections.get("ScientificPages")

# Text-based hybrid search (BM25 + vector)
response = papers.query.hybrid(
    query="What learning rate was used for fine-tuning BERT?",
    alpha=0.5,  # Balance between BM25 and vector search
    limit=20,
    return_metadata=MetadataQuery(score=True),
    target_vector="text_embedding",  # Arctic 2.0 embeddings
)

for obj in response.objects:
    print(f"Paper: {obj.properties['paper_title']}")
    print(f"Page: {obj.properties['page_number']}")
    print(f"Score: {obj.metadata.score:.3f}")
    print(f"Text: {obj.properties['ocr_text'][:200]}...")
    print()

ColQwen2 Image Retrieval (Open-source: 43% R@1, 93% R@20)

Multi-vector vision embeddings from page images. No OCR needed — query directly against rendered pages.

from colpali_engine.models import ColQwen2, ColQwen2Processor
from PIL import Image
import torch

# Open-source image embeddings (R@1: 43%, R@20: 93%)
model = ColQwen2.from_pretrained(
    "vidore/colqwen2-v1.0",
    torch_dtype=torch.float16,
    device_map="cuda",
)
processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v1.0")

# Process queries
queries = ["What learning rate was used for fine-tuning?"]
query_inputs = processor.process_queries(queries).to("cuda")

with torch.no_grad():
    query_embeddings = model(**query_inputs)  # Multi-vector per query

# Process page images
images = [Image.open(f"page_{i}.png") for i in range(1, 21)]
image_inputs = processor.process_images(images).to("cuda")

with torch.no_grad():
    image_embeddings = model(**image_inputs)  # Multi-vector per image

# Score via late interaction (MaxSim)
scores = processor.score_multi_vector(query_embeddings, image_embeddings)
ranked = sorted(enumerate(scores[0].tolist()), key=lambda x: -x[1])

for idx, score in ranked[:5]:
    print(f"Page {idx+1}: {score:.3f}")

Image Retrieval (Weaviate) (43% R@1)

Image-based vector search using stored ColPali embeddings in Weaviate.

import weaviate
import base64

client = weaviate.connect_to_weaviate_cloud(
    cluster_url="YOUR_CLUSTER_URL",
    auth_credentials=weaviate.auth.AuthApiKey("YOUR_API_KEY"),
)

papers = client.collections.get("ScientificPages")

# Image-based retrieval using ColPali/ColQwen embeddings
response = papers.query.near_text(
    query="architecture diagram with skip connections",
    limit=20,
    target_vector="image_embedding",  # ColPali multi-vector embeddings
)

for obj in response.objects:
    print(f"Paper: {obj.properties['paper_title']}")
    print(f"Page: {obj.properties['page_number']}")
    # Page image available as base64
    img_b64 = obj.properties.get('page_image_b64')
    if img_b64:
        with open(f"page_{obj.properties['page_number']}.png", "wb") as f:
            f.write(base64.b64decode(img_b64))

Multimodal Hybrid Fusion (49% R@1, 95% R@20)

Reciprocal Rank Fusion to combine text and image retrieval — the winning strategy from the benchmark.

import numpy as np
from typing import List, Tuple

def reciprocal_rank_fusion(
    text_results: List[Tuple[str, float]],  # (doc_id, score)
    image_results: List[Tuple[str, float]],
    k: int = 60,  # RRF constant
    text_weight: float = 0.5,
) -> List[Tuple[str, float]]:
    """
    Fuse text and image retrieval results using Reciprocal Rank Fusion.
    On IRPAPERS: boosts R@1 from 46%/43% to 49%, R@20 from 91%/93% to 95%.
    """
    scores = {}

    for rank, (doc_id, _) in enumerate(text_results):
        scores[doc_id] = scores.get(doc_id, 0) + text_weight / (k + rank + 1)

    for rank, (doc_id, _) in enumerate(image_results):
        scores[doc_id] = scores.get(doc_id, 0) + (1 - text_weight) / (k + rank + 1)

    return sorted(scores.items(), key=lambda x: -x[1])


# Example: fuse Arctic 2.0 text search with ColPali image search
text_hits = [("page_42", 0.89), ("page_17", 0.84), ("page_103", 0.81)]
image_hits = [("page_17", 0.92), ("page_42", 0.78), ("page_55", 0.75)]

fused = reciprocal_rank_fusion(text_hits, image_hits)
for doc_id, score in fused[:5]:
    print(f"{doc_id}: {score:.4f}")

References

Related Guides