Multimodalimage-to-text

Image Captioning

Image captioning — generating natural language descriptions of images — was the task that launched the modern vision-language era when Show and Tell (2015) paired CNNs with RNNs. The field progressed through BLIP, BLIP-2, and CoCa, each improving grounding and descriptive richness, until multimodal LLMs effectively subsumed it as a special case of image-text-to-text. COCO Captions and NoCaps remain standard benchmarks, but CIDEr and SPICE scores have largely saturated — the real frontier is dense captioning, generating paragraph-level descriptions that capture spatial relationships, attributes, and background context that brief captions miss. Captioning's importance now lies more in its role as training signal for other vision-language tasks than as a standalone evaluation.

2
Datasets
30
Results
cider
Canonical metric
Canonical Benchmark

COCO Captions

330K images with 5 captions each. Standard benchmark for image captioning.

Primary metric: cider
View full leaderboard

Top 10

Leading models on COCO Captions.

RankModelciderYearSource
1
PaLI-X-55B
1492023paper
2
PaLI-17B
1492022paper
3
BEiT-3
1482022paper
4
BLIP-2 (OPT 2.7B)
1462023paper
5
BLIP-2
1462026paper
6
OFA
1452022paper
7
GIT2
1452022paper
8
GIT
1452022paper
9
CoCa
1442026paper
10
CoCa
1442022paper

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Multimodal.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace