The Sound
Classification Challenge
AudioSet is the ImageNet of audio: 2M+ clips across 632 sound classes. Understanding how models learn to hear is the foundation of audio AI.
AudioSet Stats
How Audio Classification Works
Mel Spectrogram
Raw audio is converted into a 2D time-frequency image using the Short-Time Fourier Transform (STFT), then mapped to the mel scale which approximates human pitch perception. The result is a 128-bin spectrogram where the x-axis is time and y-axis is frequency.
Patch Embedding
The spectrogram is split into non-overlapping 16x16 patches (like Vision Transformers do with images). Each patch is linearly projected into an embedding vector and combined with positional encodings that preserve spatial and temporal order.
Multi-Label Prediction
Unlike image classification (one label per image), audio clips often contain overlapping sounds. Models use sigmoid outputs per class instead of softmax, and evaluation uses mAP (mean Average Precision) rather than top-1 accuracy.
AudioSet Leaderboard
Detailed comparison on AudioSet eval set. All models use 10-second input clips with mel spectrogram features.
| Model | mAP | mAUC | Params | Pretraining | Year |
|---|---|---|---|---|---|
#1 SSLAM University of Surrey / Univ. of Edinburgh | 0.502 | 0.977 | 88M | AudioSet-2M self-supervised | 2025 |
#2 EAT Chinese Academy of Sciences | 0.486 | 0.973 | 88M | AudioSet-2M self-supervised | 2024 |
#3 A-JEPA (ViT-B) Zhejiang University / Huawei | 0.486 | 0.973 | 86M | AudioSet-2M self-supervised | 2023 |
#4 BAT University of Surrey | 0.485 | 0.973 | 91M | AudioSet-2M self-supervised | 2026 |
#5 AST (AudioSet + ImageNet) MIT/IBM | 0.485 | 0.972 | 87M | ImageNet-21k + AudioSet | 2021 |
#6 BEATs iter3 AS2M Microsoft | 0.480 | 0.975 | 90M | AudioSet-2M self-supervised | 2023 |
#7 EfficientAT-M2 TU Munich | 0.476 | 0.971 | 30M | ImageNet + AudioSet | 2023 |
#8 HTS-AT ByteDance | 0.471 | 0.970 | 31M | AudioSet | 2022 |
#9 CLAP (HTSAT-base) LAION/Microsoft | 0.463 | 0.968 | 86M | LAION-Audio-630K | 2023 |
#10 PANNs CNN14 ByteDance | 0.431 | 0.963 | 81M | AudioSet from scratch | 2020 |
SSLAM
ICLR 2025. Trains on audio mixtures to improve polyphonic robustness; +3.9% over prior SOTA on AS-2M
EAT
IJCAI 2024. Utterance-Frame Objective enables ~15x faster pretraining than MAE-style models
A-JEPA (ViT-B)
Adapts I-JEPA to audio; predicts latent representations rather than reconstructing raw spectrograms
BAT
Feb 2026. Modernizes Data2Vec 2.0 with gated attention SSL targets; new SOTA frozen-feature probing without fine-tuning
Key Metrics Explained
The primary AudioSet metric. Computes Average Precision for each of 527 classes, then averages. Handles multi-label: a clip with "dog bark" + "wind" gets scored on both predictions independently.
SOTA: 0.502 (SSLAM, 2025)
Measures how well the model separates positive from negative examples across all operating thresholds. Less sensitive to class imbalance than mAP. Most models score 0.96–0.98, making it less discriminating at the frontier.
SOTA: 0.977 (SSLAM, 2025)
Standard single-label accuracy used for ESC-50 where each clip has exactly one correct class. Unlike mAP, this uses argmax (one prediction per clip). ESC-50 is approaching saturation at 99.1%.
SOTA: 99.1% (OmniVec2, 2024)
Architecture Evolution
Audio classification architectures have evolved through three major phases. CNNs dominated until 2020 with models like PANNs. Vision Transformers took over in 2021 when AST showed that spectrograms could be treated as images. Since 2023, self-supervised methods (BEATs, EAT, SSLAM) have pushed SOTA by pre-training on unlabeled audio.
CNN Era (2017–2020)
PANNs (2020) established the CNN baseline with stacked convolutional blocks and attention pooling. Trained directly on AudioSet from scratch. Still widely deployed in production for its simplicity and inference speed. Peaked at 0.431 mAP.
Transformer Era (2021–2022)
AST (2021) proved pure self-attention works for audio by treating spectrograms as patch sequences. ImageNet pretraining transferred surprisingly well. HTS-AT added hierarchical structure from Swin Transformer. Jump to 0.485 mAP.
SSL Era (2023–present)
Self-supervised learning on 2M+ unlabeled AudioSet clips. BEATs uses discrete tokenization, EAT bootstraps with utterance-frame objectives, and SSLAM trains on audio mixtures for polyphonic robustness. Current SOTA: 0.502 mAP.
ESC-50 Benchmark
ESC-50 contains 2,000 five-second recordings across 50 environmental sound classes (rain, dog barking, clock ticking, etc.). Unlike AudioSet, it is single-label and measured by top-1 accuracy. With OmniVec2 at 99.1%, this benchmark is approaching saturation.
| Rank | Model | Accuracy (%) | Params | Pretraining | Year |
|---|---|---|---|---|---|
| #1 | OmniVec2 TCS Research | 99.1 | 307M | Multimodal (12 modalities) | 2024 |
| #2 | MaskSpec Beijing Academy of AI | 98.2 | 86M | AudioSet self-supervised | 2022 |
| #3 | BEATs Microsoft | 98.1 | 90M | AudioSet-2M | 2023 |
| #4 | SSAST MIT/IBM | 96.8 | 89M | AudioSet + LibriSpeech | 2022 |
| #5 | CLAP LAION | 96.7 | 86M | LAION-Audio-630K | 2023 |
| #6 | SSLAM University of Surrey / Univ. of Edinburgh | 96.2 | 88M | AudioSet-2M self-supervised | 2025 |
| #7 | EAT Chinese Academy of Sciences | 95.9 | 88M | AudioSet-2M self-supervised | 2024 |
| #8 | AST MIT/IBM | 95.6 | 87M | ImageNet + AudioSet | 2021 |
| #9 | BAT University of Surrey | 95.5 | 91M | AudioSet-2M self-supervised | 2026 |
| #10 | PANNs CNN14 ByteDance | 94.7 | 81M | AudioSet | 2020 |
| #11 | 92.3 | 317M | LibriLight 60k hours | 2020 |
AudioSet Sound Classes
AudioSet's 632 classes are organized in a hierarchical ontology. Here are the top-level categories with example classes and their frequency in the dataset.
Human sounds
850K+Music
620K+Animals
180K+Environment
240K+Vehicles
310K+Domestic
190K+Impacts
95K+Mechanical
120K+Quick Start Code
Run audio classification inference in a few lines. Here are copy-paste examples for the top models.
from transformers import AutoFeatureExtractor
from transformers import ASTForAudioClassification
import torchaudio
# Load model and feature extractor
model = ASTForAudioClassification.from_pretrained(
"MIT/ast-finetuned-audioset-10-10-0.4593"
)
extractor = AutoFeatureExtractor.from_pretrained(
"MIT/ast-finetuned-audioset-10-10-0.4593"
)
# Load audio (resample to 16kHz)
waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)
# Inference
inputs = extractor(waveform.squeeze(), sampling_rate=16000,
return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits.sigmoid() # multi-label
# Top 5 predictions
top5 = probs[0].topk(5)
for score, idx in zip(top5.values, top5.indices):
label = model.config.id2label[idx.item()]
print(f"{label}: {score:.3f}")import panns_inference
from panns_inference import AudioTagging
import librosa
# Load pretrained PANNs CNN14
model = AudioTagging(
checkpoint_path=None, # auto-downloads
device='cuda'
)
# Load and preprocess audio
audio, sr = librosa.load("audio.wav", sr=32000, mono=True)
# Run inference
clipwise_output, embedding = model.inference(audio[None, :])
# clipwise_output shape: (1, 527) — sigmoid probs
# embedding shape: (1, 2048) — can use for downstream
# Top predictions
import numpy as np
labels = panns_inference.labels
top_k = np.argsort(clipwise_output[0])[::-1][:5]
for idx in top_k:
print(f"{labels[idx]}: {clipwise_output[0][idx]:.3f}")import torch
import torchaudio
from BEATs import BEATs, BEATsConfig
# Load BEATs checkpoint
checkpoint = torch.load("BEATs_iter3_plus_AS2M.pt")
cfg = BEATsConfig(checkpoint['cfg'])
model = BEATs(cfg)
model.load_state_dict(checkpoint['model'])
model.eval()
# Load audio
waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)
# Extract features + classify
with torch.no_grad():
padding_mask = torch.zeros(waveform.shape).bool()
probs = model.extract_features(
waveform, padding_mask=padding_mask
)[0] # (1, time_steps, 527)
# Average over time for clip-level prediction
clip_probs = probs.mean(dim=1).sigmoid()import torch
import torchaudio
import matplotlib.pyplot as plt
# Load audio
waveform, sr = torchaudio.load("audio.wav")
# Create mel spectrogram transform
mel_transform = torchaudio.transforms.MelSpectrogram(
sample_rate=sr,
n_fft=1024, # FFT window size
hop_length=320, # 20ms hop @ 16kHz
n_mels=128, # mel frequency bins
f_min=50, # min frequency
f_max=8000, # max frequency
)
# Convert to log-mel (decibels)
mel_spec = mel_transform(waveform)
log_mel = torchaudio.transforms.AmplitudeToDB()(mel_spec)
# Visualize
plt.figure(figsize=(10, 4))
plt.imshow(log_mel[0], aspect='auto', origin='lower',
cmap='magma')
plt.colorbar(label='dB')
plt.xlabel('Time frames')
plt.ylabel('Mel bins')
plt.title('Log-Mel Spectrogram')
plt.tight_layout()
plt.savefig('mel_spectrogram.png', dpi=150)AudioSet SOTA Progress
Mean Average Precision on AudioSet has improved from 0.431 to 0.502 over five years — a 16% relative improvement driven primarily by self-supervised pretraining.
Frequently Asked Questions
What is AudioSet?+
What is mAP in audio classification?+
What is the best audio classification model in 2026?+
How do mel spectrograms work for audio AI?+
What is the difference between AudioSet and ESC-50?+
Why do audio models use Vision Transformers?+
Contribute to Audio Classification
Have you achieved better results on AudioSet or ESC-50? Benchmarked a new architecture? Help the community by sharing your verified results.