Codesota · Lineage · Audio Understanding Benchmarks7 benchmarks · 6 edgesUpdated 2026-04-27

Benchmark lineage

Audio Understanding Benchmarks

How audio AI evaluation evolved from environmental sound classification on small datasets through large-scale event detection to foundation-model-era benchmarks that combine audio perception with language understanding. The lineage runs from ESC-50 (2015) through AudioSet (2017) to audio-text retrieval and captioning benchmarks (Clotho, AudioCaps — popularised by the CLAP model), then to VoiceBench and AudioBench which test audio-language model instruction following. Branches include MUSDB18 (music source separation) and MusicNet (symbolic music).

Editor's note

ESC-50 saturated in the late 2010s and serves as a sanity check today, not a frontier benchmark. AudioSet remains the backbone training resource for audio classification but its eval set is sufficiently clean that strong models exceed 50 mAP. The inflection point was the CLAP *model* (Contrastive Language-Audio Pretraining, 2022) — its release formalised zero-shot audio-text retrieval as the standard evaluation on AudioCaps and Clotho, and the interesting question shifted from 'classify this sound' to 'describe this audio' and 'follow instructions about this audio'. VoiceBench and AudioBench (2024) represent the current frontier: instruction-following evaluation for audio-language models, where the task is as open-ended as text LLM evaluation but grounded in audio input. No single benchmark has consolidated community attention the way LibriSpeech did for ASR.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded

ESC-50 → AudioSet · direct successor · attention

AudioSet replaced ESC-50 as the primary audio classification benchmark — 527 classes vs 50, 2M clips vs 2K, hierarchical ontology. Scale and coverage made it the ImageNet analogue for audio. ESC-50 became a probe task for pretrained representations.

ESC-50 → MUSDB18 · scope shift

MUSDB18 branches into music source separation — a generative audio task, not a classification one. Different task family entirely; ESC-50's sound-class framework doesn't apply.

AudioSet → Clotho · scope shift

Clotho shifted the evaluation task from classification (what sounds are here?) to captioning (describe these sounds in a sentence). A scope shift enabled by the growing capability of audio encoders trained on AudioSet.

Clotho → AudioCaps + Clotho Retrieval · scope shift · attention

Audio-text retrieval (formalised in the CLAP paper, evaluated on AudioCaps + Clotho) gave audio captioning benchmarks a cleaner zero-shot counterpart. Retrieval R@K is computationally cleaner than caption generation metrics like SPIDEr and scales better.

AudioCaps + Clotho Retrieval → AudioBench · scope shift · attention

AudioCaps/Clotho retrieval evaluates embedding alignment; AudioBench evaluates full audio-language model instruction following across 26 datasets. The transition from probing representations to evaluating end-to-end LALMs.

AudioBench → VoiceBench · scope shift · attention

VoiceBench narrows AudioBench's broad scope to the specific failure mode of voice instruction following — accent robustness, noise, speaking-style variation. The current frontier for evaluating real-world voice assistant capability.

§ 02 · Benchmarks in this lineage

Nodes in detail.

Jan 2015Saturated

View benchmark page →

ESC-50

Environmental Sound Classification ESC-50

2,000 clips across 50 environmental sound classes (animals, natural soundscapes, human sounds, mechanical). Human accuracy ~81.3%. Convolutional audio classifiers surpassed human accuracy by 2019. Used today as a downstream probing task for audio representations, not as a frontier benchmark.

Piczak · paper

Mar 2017Saturating

View benchmark page →

AudioSet

Google AudioSet

2 million 10-second YouTube clips with 527 audio event labels in a hierarchical ontology. The ImageNet of audio — massive scale, used more as a pretraining resource than a pure eval benchmark. Strong models exceed 50 mAP on the eval set. HTSAT and PaSST established Transformer-era baselines.

Gemmeke et al. (Google) · paper

Oct 2017Active

MUSDB18

MUSDB18 Music Source Separation

150 full-length stereo music tracks with separated stems (vocals, drums, bass, other). The standard source separation benchmark; signal-to-distortion ratio (SDR) is the metric. Open-Unmix and Demucs established strong baselines; Demucs v4 reached ~9.2 dB SDR vocal. Still the reference for music separation.

Rafii et al. (Deezer / Adobe / UPF) · paper

Jan 2020Active

Clotho

Clotho Audio Captioning Dataset

6,974 audio clips with 5 crowdsourced captions each. The standard audio captioning benchmark; SPIDEr and FENSE metrics used. Established the audio-to-text captioning task as a distinct evaluation domain — different from classification, closer to image captioning.

Drossos et al. (Tampere University) · paper

Nov 2022Active

AudioCaps + Clotho Retrieval

Zero-Shot Audio-Text Retrieval (AudioCaps / Clotho splits)

Zero-shot audio↔text retrieval task evaluated on the AudioCaps and Clotho test splits — text-to-audio R@K and audio-to-text R@K. The benchmark setup was formalised in the CLAP paper (CLAP is the reference model, not the benchmark itself); subsequent audio-text models all report on the same splits.

Wu et al. (Microsoft / UIUC) · paper

Jun 2024Active

AudioBench

AudioBench: Universal Audio Language Model Benchmark

8 tasks, 26 datasets covering speech understanding, audio scene analysis, and music understanding. Designed to evaluate large audio-language models (LALMs) on instruction following over diverse audio inputs. Gemini 1.5 Pro and GPT-4o Audio are the top performers; open-source LALMs trail significantly.

Wang et al. · paper

Oct 2024Active

VoiceBench

VoiceBench: Speech Instruction Following

Evaluates spoken instruction-following capability across diverse accents, noise conditions, and speaking styles. Tests whether audio-language models follow spoken instructions as well as they follow typed ones — a critical deployment gap. The current benchmark focused specifically on voice-interface robustness.

Chen et al. · paper