Image Captioning
Image captioning — generating natural language descriptions of images — was the task that launched the modern vision-language era when Show and Tell (2015) paired CNNs with RNNs. The field progressed through BLIP, BLIP-2, and CoCa, each improving grounding and descriptive richness, until multimodal LLMs effectively subsumed it as a special case of image-text-to-text. COCO Captions and NoCaps remain standard benchmarks, but CIDEr and SPICE scores have largely saturated — the real frontier is dense captioning, generating paragraph-level descriptions that capture spatial relationships, attributes, and background context that brief captions miss. Captioning's importance now lies more in its role as training signal for other vision-language tasks than as a standalone evaluation.
COCO Captions
330K images with 5 captions each. Standard benchmark for image captioning.
Top 10
Leading models on COCO Captions.
All datasets
2 datasets tracked for this task.
Related tasks
Other tasks in Multimodal.
Looking to run a model? HuggingFace hosts inference for this task type.