Cross-modal retrieval finds the best match between items in different modalities — given text, find the right image; given an image, find the right caption. CLIP (2021) revolutionized the field by learning a shared embedding space from 400M image-text pairs, spawning an entire ecosystem of models like SigLIP, EVA-CLIP, and OpenCLIP that power everything from search engines to generative model guidance. The challenge has shifted from coarse retrieval to fine-grained discrimination: telling apart nearly identical images based on subtle textual differences, or retrieving across underrepresented domains and languages. Recall@K on Flickr30K and COCO may look saturated, but real-world deployment exposes failures on long-tail queries and compositional descriptions.
Visual document retrieval benchmark for page-level document search
Leading models on ViDoRe.
No results yet. Be the first to contribute.
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
Still looking for something on Cross-Modal Retrieval? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.