Multimodalvisual-document-retrieval

Cross-Modal Retrieval

Cross-modal retrieval finds the best match between items in different modalities — given text, find the right image; given an image, find the right caption. CLIP (2021) revolutionized the field by learning a shared embedding space from 400M image-text pairs, spawning an entire ecosystem of models like SigLIP, EVA-CLIP, and OpenCLIP that power everything from search engines to generative model guidance. The challenge has shifted from coarse retrieval to fine-grained discrimination: telling apart nearly identical images based on subtle textual differences, or retrieving across underrepresented domains and languages. Recall@K on Flickr30K and COCO may look saturated, but real-world deployment exposes failures on long-tail queries and compositional descriptions.

1
Datasets
0
Results
ndcg-at-5
Canonical metric
Canonical Benchmark

ViDoRe

Visual document retrieval benchmark for page-level document search

Primary metric: ndcg-at-5
View full leaderboard

Top 10

Leading models on ViDoRe.

No results yet. Be the first to contribute.

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Multimodal.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace