Multimodalimage-to-text

Image Captioning

Image captioning — generating natural language descriptions of images — was the task that launched the modern vision-language era when Show and Tell (2015) paired CNNs with RNNs. The field progressed through BLIP, BLIP-2, and CoCa, each improving grounding and descriptive richness, until multimodal LLMs effectively subsumed it as a special case of image-text-to-text. COCO Captions and NoCaps remain standard benchmarks, but CIDEr and SPICE scores have largely saturated — the real frontier is dense captioning, generating paragraph-level descriptions that capture spatial relationships, attributes, and background context that brief captions miss. Captioning's importance now lies more in its role as training signal for other vision-language tasks than as a standalone evaluation.

Datasets

Results

cider

Canonical metric

Canonical Benchmark

COCO Captions

330K images with 5 captions each. Standard benchmark for image captioning.

Primary metric: cider

View full leaderboard

Top 10

Leading models on COCO Captions.

Rank	Model	cider	Year	Source
1	PaLI-X-55B	149	2023	paper
2	PaLI-17B	149	2022	paper
3	BEiT-3	148	2022	paper
4	BLIP-2 (OPT 2.7B)	146	2023	paper
5	BLIP-2	146	2026	paper
6	OFA	145	2022	paper
7	GIT2	145	2022	paper
8	GIT	145	2022	paper
9	CoCa	144	2026	paper
10	CoCa	144	2022	paper

All datasets

2 datasets tracked for this task.

COCO Captions

CANONICAL

25results·cider

Top: PaLI-X-55B — 149

NoCaps

5results·cider

Top: CogVLM-17B — 128

Related tasks

Other tasks in Multimodal.

Any-to-Any Audio-Text-to-Text Cross-Modal Retrieval Image-Text-to-Image Image-Text-to-Text Image-Text-to-Video Text-to-Image Generation Video Understanding

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace