Audio Classification Benchmark

The Sound
Classification Challenge

AudioSet is the ImageNet of audio: 2M+ clips across 632 sound classes. Understanding how models learn to hear is the foundation of audio AI.

AudioSet Stats

2,084,320
Total Clips (10s each)
0.502
Current SOTA (mAP)
632
Sound Classes

How Audio Classification Works

RAW AUDIO10s @ 16kHzSTFTMEL SPEC128 mel binsPATCHPATCHES123456789101112131415161718192016x16 + pos encENCODEENCODERSelf-AttnFFNSelf-AttnFFN12 layersHEADPREDICTIONSDog bark92%Wind78%Speech31%Music8%527 classes
01

Mel Spectrogram

Raw audio is converted into a 2D time-frequency image using the Short-Time Fourier Transform (STFT), then mapped to the mel scale which approximates human pitch perception. The result is a 128-bin spectrogram where the x-axis is time and y-axis is frequency.

02

Patch Embedding

The spectrogram is split into non-overlapping 16x16 patches (like Vision Transformers do with images). Each patch is linearly projected into an embedding vector and combined with positional encodings that preserve spatial and temporal order.

03

Multi-Label Prediction

Unlike image classification (one label per image), audio clips often contain overlapping sounds. Models use sigmoid outputs per class instead of softmax, and evaluation uses mAP (mean Average Precision) rather than top-1 accuracy.

AudioSet Leaderboard

Detailed comparison on AudioSet eval set. All models use 10-second input clips with mel spectrogram features.

ModelmAPmAUCParamsPretrainingYear
#1
SSLAM
University of Surrey / Univ. of Edinburgh
0.5020.97788MAudioSet-2M self-supervised2025
#2
EAT
Chinese Academy of Sciences
0.4860.97388MAudioSet-2M self-supervised2024
#3
A-JEPA (ViT-B)
Zhejiang University / Huawei
0.4860.97386MAudioSet-2M self-supervised2023
#4
BAT
University of Surrey
0.4850.97391MAudioSet-2M self-supervised2026
0.4850.97287MImageNet-21k + AudioSet2021
#6
0.4800.97590MAudioSet-2M self-supervised2023
#7
0.4760.97130MImageNet + AudioSet2023
#8
HTS-AT
ByteDance
0.4710.97031MAudioSet2022
#9
CLAP (HTSAT-base)
LAION/Microsoft
0.4630.96886MLAION-Audio-630K2023
#10
PANNs CNN14
ByteDance
0.4310.96381MAudioSet from scratch2020

SSLAM

University of Surrey / Univ. of Edinburgh (2025)
0.502

ICLR 2025. Trains on audio mixtures to improve polyphonic robustness; +3.9% over prior SOTA on AS-2M

ViT-Base + CNN decoder (SSL)88M params

EAT

Chinese Academy of Sciences (2024)
0.486

IJCAI 2024. Utterance-Frame Objective enables ~15x faster pretraining than MAE-style models

ViT-Base bootstrap SSL (UFO objective)88M params

A-JEPA (ViT-B)

Zhejiang University / Huawei (2023)
0.486

Adapts I-JEPA to audio; predicts latent representations rather than reconstructing raw spectrograms

ViT-Base JEPA (joint-embedding predictive)86M params

BAT

University of Surrey (2026)
0.485

Feb 2026. Modernizes Data2Vec 2.0 with gated attention SSL targets; new SOTA frozen-feature probing without fine-tuning

ViT-Base + Convex Gated Probing (CGP)91M params

Key Metrics Explained

mAPmean Average Precision

The primary AudioSet metric. Computes Average Precision for each of 527 classes, then averages. Handles multi-label: a clip with "dog bark" + "wind" gets scored on both predictions independently.

Range: 0.0 – 1.0 | Higher = better
SOTA: 0.502 (SSLAM, 2025)
mAUCmean Area Under ROC Curve

Measures how well the model separates positive from negative examples across all operating thresholds. Less sensitive to class imbalance than mAP. Most models score 0.96–0.98, making it less discriminating at the frontier.

Range: 0.0 – 1.0 | Higher = better
SOTA: 0.977 (SSLAM, 2025)
AccTop-1 Accuracy (ESC-50)

Standard single-label accuracy used for ESC-50 where each clip has exactly one correct class. Unlike mAP, this uses argmax (one prediction per clip). ESC-50 is approaching saturation at 99.1%.

Range: 0 – 100% | Higher = better
SOTA: 99.1% (OmniVec2, 2024)

Architecture Evolution

Audio classification architectures have evolved through three major phases. CNNs dominated until 2020 with models like PANNs. Vision Transformers took over in 2021 when AST showed that spectrograms could be treated as images. Since 2023, self-supervised methods (BEATs, EAT, SSLAM) have pushed SOTA by pre-training on unlabeled audio.

CNN-basedPANNs CNN14 (2020) | mAP 0.431Mel SpecConv Block 1Conv Block 2Conv Block 3Conv Block 4Attn PoolLinearPure TransformerAST (2021) | mAP 0.485Mel SpecPatch + Pos EncTransformer x12CLS TokenLinearSelf-SupervisedSSLAM (2025) | mAP 0.502Mel SpecPatch + MaskViT Encoder x12CNN DecoderFine-tune HeadArchitecture evolution: CNN (2020) → Transformer (2021) → Self-Supervised (2023-2025)

CNN Era (2017–2020)

PANNs (2020) established the CNN baseline with stacked convolutional blocks and attention pooling. Trained directly on AudioSet from scratch. Still widely deployed in production for its simplicity and inference speed. Peaked at 0.431 mAP.

Transformer Era (2021–2022)

AST (2021) proved pure self-attention works for audio by treating spectrograms as patch sequences. ImageNet pretraining transferred surprisingly well. HTS-AT added hierarchical structure from Swin Transformer. Jump to 0.485 mAP.

SSL Era (2023–present)

Self-supervised learning on 2M+ unlabeled AudioSet clips. BEATs uses discrete tokenization, EAT bootstraps with utterance-frame objectives, and SSLAM trains on audio mixtures for polyphonic robustness. Current SOTA: 0.502 mAP.

ESC-50 Benchmark

ESC-50 contains 2,000 five-second recordings across 50 environmental sound classes (rain, dog barking, clock ticking, etc.). Unlike AudioSet, it is single-label and measured by top-1 accuracy. With OmniVec2 at 99.1%, this benchmark is approaching saturation.

RankModelAccuracy (%)ParamsPretrainingYear
#1
OmniVec2
TCS Research
99.1307MMultimodal (12 modalities)2024
#2
MaskSpec
Beijing Academy of AI
98.286MAudioSet self-supervised2022
#3
BEATs
Microsoft
98.190MAudioSet-2M2023
#4
SSAST
MIT/IBM
96.889MAudioSet + LibriSpeech2022
#5
CLAP
LAION
96.786MLAION-Audio-630K2023
#6
SSLAM
University of Surrey / Univ. of Edinburgh
96.288MAudioSet-2M self-supervised2025
#7
EAT
Chinese Academy of Sciences
95.988MAudioSet-2M self-supervised2024
#8
AST
MIT/IBM
95.687MImageNet + AudioSet2021
#9
BAT
University of Surrey
95.591MAudioSet-2M self-supervised2026
#10
PANNs CNN14
ByteDance
94.781MAudioSet2020
#1192.3317MLibriLight 60k hours2020

AudioSet Sound Classes

AudioSet's 632 classes are organized in a hierarchical ontology. Here are the top-level categories with example classes and their frequency in the dataset.

🗣

Human sounds

850K+
SpeechSingingLaughterCryingCoughBreathing
🎵

Music

620K+
GuitarPianoDrumsViolinBassSynthesizer
🐕

Animals

180K+
Dog barkCat meowBird songInsect buzzRoosterFrog
🌧

Environment

240K+
RainThunderWindFireWaterWaves
🚗

Vehicles

310K+
Car engineMotorcycleTrainAircraftSirenHorn
🏠

Domestic

190K+
Door knockAlarmMicrowaveVacuumTypingFrying
💥

Impacts

95K+
ExplosionGlass breakSlamCrashThudCrack

Mechanical

120K+
EngineDrillSawPumpGrindingRatchet

Quick Start Code

Run audio classification inference in a few lines. Here are copy-paste examples for the top models.

PythonAST (Audio Spectrogram Transformer)
0.485 mAP
from transformers import AutoFeatureExtractor
from transformers import ASTForAudioClassification
import torchaudio

# Load model and feature extractor
model = ASTForAudioClassification.from_pretrained(
    "MIT/ast-finetuned-audioset-10-10-0.4593"
)
extractor = AutoFeatureExtractor.from_pretrained(
    "MIT/ast-finetuned-audioset-10-10-0.4593"
)

# Load audio (resample to 16kHz)
waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
    waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)

# Inference
inputs = extractor(waveform.squeeze(), sampling_rate=16000,
                   return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits.sigmoid()  # multi-label

# Top 5 predictions
top5 = probs[0].topk(5)
for score, idx in zip(top5.values, top5.indices):
    label = model.config.id2label[idx.item()]
    print(f"{label}: {score:.3f}")
PythonPANNs (Production Baseline)
0.431 mAP
import panns_inference
from panns_inference import AudioTagging
import librosa

# Load pretrained PANNs CNN14
model = AudioTagging(
    checkpoint_path=None,  # auto-downloads
    device='cuda'
)

# Load and preprocess audio
audio, sr = librosa.load("audio.wav", sr=32000, mono=True)

# Run inference
clipwise_output, embedding = model.inference(audio[None, :])

# clipwise_output shape: (1, 527) — sigmoid probs
# embedding shape: (1, 2048) — can use for downstream

# Top predictions
import numpy as np
labels = panns_inference.labels
top_k = np.argsort(clipwise_output[0])[::-1][:5]
for idx in top_k:
    print(f"{labels[idx]}: {clipwise_output[0][idx]:.3f}")
PythonBEATs (Self-Supervised)
0.480 mAP
import torch
import torchaudio
from BEATs import BEATs, BEATsConfig

# Load BEATs checkpoint
checkpoint = torch.load("BEATs_iter3_plus_AS2M.pt")
cfg = BEATsConfig(checkpoint['cfg'])
model = BEATs(cfg)
model.load_state_dict(checkpoint['model'])
model.eval()

# Load audio
waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
    waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)

# Extract features + classify
with torch.no_grad():
    padding_mask = torch.zeros(waveform.shape).bool()
    probs = model.extract_features(
        waveform, padding_mask=padding_mask
    )[0]  # (1, time_steps, 527)

# Average over time for clip-level prediction
clip_probs = probs.mean(dim=1).sigmoid()
PythonGenerate Mel Spectrogram
preprocessing
import torch
import torchaudio
import matplotlib.pyplot as plt

# Load audio
waveform, sr = torchaudio.load("audio.wav")

# Create mel spectrogram transform
mel_transform = torchaudio.transforms.MelSpectrogram(
    sample_rate=sr,
    n_fft=1024,         # FFT window size
    hop_length=320,      # 20ms hop @ 16kHz
    n_mels=128,          # mel frequency bins
    f_min=50,            # min frequency
    f_max=8000,          # max frequency
)

# Convert to log-mel (decibels)
mel_spec = mel_transform(waveform)
log_mel = torchaudio.transforms.AmplitudeToDB()(mel_spec)

# Visualize
plt.figure(figsize=(10, 4))
plt.imshow(log_mel[0], aspect='auto', origin='lower',
           cmap='magma')
plt.colorbar(label='dB')
plt.xlabel('Time frames')
plt.ylabel('Mel bins')
plt.title('Log-Mel Spectrogram')
plt.tight_layout()
plt.savefig('mel_spectrogram.png', dpi=150)

AudioSet SOTA Progress

Mean Average Precision on AudioSet has improved from 0.431 to 0.502 over five years — a 16% relative improvement driven primarily by self-supervised pretraining.

0.420.440.460.480.50PANNs0.431AST0.485HTS-AT0.471BEATs0.480A-JEPA0.486EAT0.486SSLAM0.502202020212022202320242025CNN EraTransformer EraSelf-Supervised Era

Frequently Asked Questions

What is AudioSet?+
AudioSet is Google's large-scale audio dataset containing over 2 million 10-second clips labeled across 632 sound event classes. It is the de facto benchmark for audio classification, similar to ImageNet for computer vision. The ontology covers everything from music instruments to animal sounds, speech, and environmental noise.
What is mAP in audio classification?+
mAP (mean Average Precision) is the primary metric for AudioSet evaluation. Because AudioSet is multi-label (a clip can contain multiple overlapping sounds), mAP computes the average precision for each class and then averages across all 527 evaluation classes. It ranges from 0 to 1, with higher being better. The current SOTA is 0.502 by SSLAM (2025).
What is the best audio classification model in 2026?+
SSLAM from the University of Surrey and University of Edinburgh leads AudioSet with 0.502 mAP (ICLR 2025). It uses a ViT-Base encoder with self-supervised learning on audio mixtures. For ESC-50 environmental sound classification, OmniVec2 leads with 99.1% accuracy. For edge/real-time deployment, EfficientAT-M2 (0.476 mAP, 30M params) offers the best efficiency.
How do mel spectrograms work for audio AI?+
A mel spectrogram converts raw audio waveforms into a 2D time-frequency image. The audio is split into short overlapping windows (typically 25ms), a Fourier transform computes the frequency content of each window, and the frequencies are mapped to the mel scale (which approximates human pitch perception). The result is a 2D image where the x-axis is time, the y-axis is mel-frequency, and pixel intensity represents energy. This allows audio models to use the same architectures as image classifiers.
What is the difference between AudioSet and ESC-50?+
AudioSet is a large-scale multi-label benchmark (2M clips, 632 classes, measured by mAP) focusing on general audio event detection including music, speech, and environmental sounds. ESC-50 is a smaller single-label benchmark (2,000 clips, 50 classes, measured by accuracy) focused on environmental sounds like rain, dog barking, and clock ticking. AudioSet tests multi-label detection in noisy real-world audio; ESC-50 tests clean environmental sound recognition.
Why do audio models use Vision Transformers?+
Audio models use Vision Transformers (ViT) because mel spectrograms are 2D images. The Audio Spectrogram Transformer (AST, 2021) showed that splitting a spectrogram into 16x16 patches and processing them with ViT-B/16 achieves state-of-the-art results without any convolutional layers. This opened the door to leveraging ImageNet-pretrained weights and self-supervised techniques from computer vision, driving rapid progress in audio classification.

Contribute to Audio Classification

Have you achieved better results on AudioSet or ESC-50? Benchmarked a new architecture? Help the community by sharing your verified results.