Codesota · Tasks · Cross-Modal RetrievalHome/Tasks/Multimodal/Cross-Modal Retrieval

Multimodal· visual-document-retrieval

Cross-Modal Retrieval.

Cross-modal retrieval finds the best match between items in different modalities — given text, find the right image; given an image, find the right caption. CLIP (2021) revolutionized the field by learning a shared embedding space from 400M image-text pairs, spawning an entire ecosystem of models like SigLIP, EVA-CLIP, and OpenCLIP that power everything from search engines to generative model guidance. The challenge has shifted from coarse retrieval to fine-grained discrimination: telling apart nearly identical images based on subtle textual differences, or retrieving across underrepresented domains and languages. Recall@K on Flickr30K and COCO may look saturated, but real-world deployment exposes failures on long-tail queries and compositional descriptions.

Datasets

Results

ndcg-at-5

Canonical metric

§ 02 · Canonical benchmark