Codesota · Tasks · Audio CaptioningHome/Tasks/Audio/Audio Captioning

Audio Captioning.

Generating text descriptions of audio content.

Datasets

Results

spider

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

AudioCaps

Audio generation quality evaluated on AudioCaps captions

Primary metric: spider

§ 03 · Top 10

Leading models on AudioCaps.

#	Model	r-10	Year	Source
★	CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A)	83.7	2022	paper ↗
2	CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A)	71.9	2022	paper ↗
3	CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A)	35.1	2022	paper ↗
4	Audio Flamingo 3	0.700	2025	paper ↗
5	AudioCaps baseline (TopDown+Align)	0.369	2026	paper ↗
6	EnCLAP-base	0.300	2026	paper ↗
7	Pengi	0.271	2026	paper ↗

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

1 dataset tracked for this task.

§ 05 · Related tasks

Reply within 48 hours · No newsletter

Still looking for something on Audio Captioning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.