Audio Classification Benchmark

The Sound
Classification Challenge

AudioSet is the ImageNet of audio: 2M+ clips across 632 sound classes. Understanding how models learn to hear is the foundation of audio AI.

AudioSet Stats

2,084,320
Total Clips (10s each)
0.498
Current SOTA (mAP)
632
Sound Classes

The Multi-Label Classification Challenge

Unlike image classification where each image has one label, audio clips often contain multiple simultaneous sounds. A 10-second clip might have "Speech", "Music", and "Traffic" all at once.

Single-Label vs Multi-Label

1

ImageNet (Single-Label)

Each image has exactly one correct label. Use softmax + cross-entropy.

"cat" OR "dog" OR "car"
N

AudioSet (Multi-Label)

Each clip can have multiple labels. Use sigmoid + binary cross-entropy.

"Speech" AND "Music" AND "Laughter"

Why mAP (mean Average Precision)?

For multi-label problems, we need metrics that handle varying thresholds and class imbalance:

Average Precision (AP)

For each class, rank all clips by confidence and compute the area under the precision-recall curve.

Mean AP (mAP)

Average AP across all 527 evaluation classes. Treats rare and common sounds equally.

mAUC (mean AUC-ROC)

Area under ROC curve, averaged across classes. Often reported alongside mAP.

AudioSet Ontology

AudioSet organizes 632 classes into a hierarchical ontology with 7 top-level categories:

Human sounds

42 classes
SpeechLaughterCryingCoughSneeze

Animal

75 classes
Dog barkCat meowBird chirpRoosterInsect

Music

137 classes
SingingPianoGuitarDrumsOrchestra

Sounds of things

263 classes
EngineDoorGlassExplosionGunshot

Natural sounds

31 classes
RainThunderWindWaterFire

Source-ambiguous

14 classes
SilenceWhite noiseStaticEcho

Model Architectures

Spectrogram for CNN

CNN-Based (PANNs)

Convolutional layers extract local frequency-time patterns. Attention pooling aggregates across time.

Fast inference Production-ready
Transformer Patches

Vision Transformer (AST)

Treats spectrogram as image. Patch embeddings + self-attention captures global dependencies.

SOTA quality Transfer learning
Model Comparison

Audio Tokenizer (BEATs)

Learns discrete audio tokens via iterative SSL. Self-distillation from continuous to discrete.

Current SOTA Efficient
Architecture Evolution Timeline
20

2020: CNN Era (PANNs)

Kong et al. establish strong CNN baselines with attention pooling. mAP: 0.431

21

2021: Transformer Revolution (AST)

Gong et al. apply ViT to audio, transfer from ImageNet. mAP: 0.485 (+0.054)

22

2022: Efficient Transformers (HTS-AT)

Chen et al. use Swin architecture for better efficiency. mAP: 0.471

23

2023: Self-Supervised (BEATs)

Chen et al. iterate SSL with discrete tokenization. mAP: 0.498 (+0.013)

AudioSet Leaderboard

Detailed comparison on AudioSet eval set. All models use 10-second input clips with mel spectrogram features.

Model mAP mAUC Params Pretraining Year
#1
0.498 0.975 90M AudioSet-2M self-supervised 2023
0.485 0.972 87M ImageNet-21k + AudioSet 2021
#3
0.476 0.971 30M ImageNet + AudioSet 2023
#4
HTS-AT
ByteDance
0.471 0.970 31M AudioSet 2022
#5
CLAP (HTSAT-base)
LAION/Microsoft
0.463 0.968 86M LAION-Audio-630K 2023
#6
PANNs CNN14
ByteDance
0.431 0.963 81M AudioSet from scratch 2020

BEATs iter3+ AS2M

Microsoft (2023)
0.498

Iterative audio pre-training with discrete tokenization

Audio Tokenizer + Transformer 90M params

AST (AudioSet + ImageNet)

MIT/IBM (2021)
0.485

First pure attention model for audio, no convolutions

Vision Transformer (ViT-B/16) 87M params

EfficientAT-M2

TU Munich (2023)
0.476

Best efficiency, real-time capable on edge devices

EfficientNet + Mel-adapted 30M params

HTS-AT

ByteDance (2022)
0.471

Efficient hierarchical structure, good compute/accuracy tradeoff

Swin Transformer + Token-Semantic 31M params

ESC-50 Benchmark

Environmental Sound Classification

ESC-50 is a simpler, cleaner benchmark: 2,000 5-second clips across 50 environmental sound classes. Single-label classification with 5-fold cross-validation makes it easier to compare models directly.

Total clips 2,000
Classes 50
Clip duration 5 seconds
Evaluation 5-fold CV

ESC-50 Class Categories

Animals
Dog, Cat, Rooster, Pig, Cow, Frog, etc.
Natural
Rain, Sea waves, Crackling fire, Wind, etc.
Human (non-speech)
Crying baby, Sneezing, Clapping, Laughing, etc.
Interior/Domestic
Door knock, Clock alarm, Keyboard, etc.
Exterior/Urban
Helicopter, Chainsaw, Siren, Car horn, Train, etc.
Rank Model Accuracy (%) Params Pretraining Year
#1
BEATs
Microsoft
98.1 90M AudioSet-2M 2023
#2
SSAST
MIT/IBM
96.8 89M AudioSet + LibriSpeech 2022
#3
CLAP
LAION
96.7 86M LAION-Audio-630K 2023
#4
AST
MIT/IBM
95.6 87M ImageNet + AudioSet 2021
#5
PANNs CNN14
ByteDance
94.7 81M AudioSet 2020
#6
Wav2Vec 2.0 + Linear
Meta
92.3 317M LibriLight 60k hours 2020

Known AudioSet Issues

  • Label Noise: ~30% of labels may be incorrect or incomplete
  • Missing Videos: ~20% of YouTube videos no longer available
  • Class Imbalance: "Speech" has millions; "Yodeling" has hundreds
  • Weak Labels: Labels apply to full 10s clip, not specific moments
  • Copyright Issues: Some videos may have been removed for copyright

Best Practices

  • Pretraining Matters: ImageNet weights transfer surprisingly well
  • Augmentation: SpecAugment, mixup, and time-stretch are essential
  • Balanced Sampling: Oversample rare classes during training
  • Multi-Scale: Use multiple spectrogram resolutions
  • Ensemble: Top results often combine multiple architectures

Related Datasets

AudioSet

2017

2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.

Samples
2,000,000
Primary Metric
MAP

ESC-50

2015

2,000 environmental audio recordings organized into 50 classes (animals, natural soundscapes, etc.).

Samples
2,000
Primary Metric
ACCURACY

Contribute to Audio Classification

Have you achieved better results on AudioSet or ESC-50? Benchmarked a new architecture? Help the community by sharing your verified results.