Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Audio CaptioningHome/Tasks/Audio/Audio Captioning

Audio Captioning.

Generating text descriptions of audio content.

1
Datasets
7
Results
spider
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

AudioCaps

Audio generation quality evaluated on AudioCaps captions

Primary metric: spider
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on AudioCaps.

#Modelr-10YearSource
CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A)83.72022paper ↗
2CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A)71.92022paper ↗
3CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A)35.12022paper ↗
4Audio Flamingo 30.7002025paper ↗
5AudioCaps baseline (TopDown+Align)0.3692026paper ↗
6EnCLAP-base0.3002026paper ↗
7Pengi0.2712026paper ↗

What were you looking for on Audio Captioning?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

AudioCaps
CANONICAL
7 results · spider
Top: CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A) 83.7
§ 05 · Related tasks

Other tasks in Audio.

Audio-to-AudioMusic GenerationSound Event DetectionText-to-AudioVoice Activity Detection
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Audio Captioning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.