Audio Understanding Benchmarks
How audio AI evaluation evolved from environmental sound classification on small datasets through large-scale event detection to foundation-model-era benchmarks that combine audio perception with language understanding. The lineage runs from ESC-50 (2015) through AudioSet (2017) to audio-text retrieval and captioning benchmarks (Clotho, AudioCaps — popularised by the CLAP model), then to VoiceBench and AudioBench which test audio-language model instruction following. Branches include MUSDB18 (music source separation) and MusicNet (symbolic music).
ESC-50 saturated in the late 2010s and serves as a sanity check today, not a frontier benchmark. AudioSet remains the backbone training resource for audio classification but its eval set is sufficiently clean that strong models exceed 50 mAP. The inflection point was the CLAP *model* (Contrastive Language-Audio Pretraining, 2022) — its release formalised zero-shot audio-text retrieval as the standard evaluation on AudioCaps and Clotho, and the interesting question shifted from 'classify this sound' to 'describe this audio' and 'follow instructions about this audio'. VoiceBench and AudioBench (2024) represent the current frontier: instruction-following evaluation for audio-language models, where the task is as open-ended as text LLM evaluation but grounded in audio input. No single benchmark has consolidated community attention the way LibriSpeech did for ASR.
Attention path plus branches.
Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.
Nodes in detail.
ESC-50
2,000 clips across 50 environmental sound classes (animals, natural soundscapes, human sounds, mechanical). Human accuracy ~81.3%. Convolutional audio classifiers surpassed human accuracy by 2019. Used today as a downstream probing task for audio representations, not as a frontier benchmark.
AudioSet
2 million 10-second YouTube clips with 527 audio event labels in a hierarchical ontology. The ImageNet of audio — massive scale, used more as a pretraining resource than a pure eval benchmark. Strong models exceed 50 mAP on the eval set. HTSAT and PaSST established Transformer-era baselines.
MUSDB18
150 full-length stereo music tracks with separated stems (vocals, drums, bass, other). The standard source separation benchmark; signal-to-distortion ratio (SDR) is the metric. Open-Unmix and Demucs established strong baselines; Demucs v4 reached ~9.2 dB SDR vocal. Still the reference for music separation.
Clotho
6,974 audio clips with 5 crowdsourced captions each. The standard audio captioning benchmark; SPIDEr and FENSE metrics used. Established the audio-to-text captioning task as a distinct evaluation domain — different from classification, closer to image captioning.
AudioCaps + Clotho Retrieval
Zero-shot audio↔text retrieval task evaluated on the AudioCaps and Clotho test splits — text-to-audio R@K and audio-to-text R@K. The benchmark setup was formalised in the CLAP paper (CLAP is the reference model, not the benchmark itself); subsequent audio-text models all report on the same splits.
AudioBench
8 tasks, 26 datasets covering speech understanding, audio scene analysis, and music understanding. Designed to evaluate large audio-language models (LALMs) on instruction following over diverse audio inputs. Gemini 1.5 Pro and GPT-4o Audio are the top performers; open-source LALMs trail significantly.
VoiceBench
Evaluates spoken instruction-following capability across diverse accents, noise conditions, and speaking styles. Tests whether audio-language models follow spoken instructions as well as they follow typed ones — a critical deployment gap. The current benchmark focused specifically on voice-interface robustness.