Generating text descriptions of audio content.
Audio generation quality evaluated on AudioCaps captions
Leading models on AudioCaps.
| # | Model | r-10 | Year | Source |
|---|---|---|---|---|
| ★ | CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A) | 83.7 | 2022 | paper ↗ |
| 2 | CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A) | 71.9 | 2022 | paper ↗ |
| 3 | CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A) | 35.1 | 2022 | paper ↗ |
| 4 | Audio Flamingo 3 | 0.700 | 2025 | paper ↗ |
| 5 | AudioCaps baseline (TopDown+Align) | 0.369 | 2026 | paper ↗ |
| 6 | EnCLAP-base | 0.300 | 2026 | paper ↗ |
| 7 | Pengi | 0.271 | 2026 | paper ↗ |
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
Still looking for something on Audio Captioning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.