Audioaudio-classification

Audio Classification

Audio classification identifies what's happening in a sound — music genre, environmental sounds, speaker emotion, language identification — and underpins everything from content moderation to smart home devices. Audio Spectrogram Transformer (AST) and BEATs brought ImageNet-level transfer learning to audio by treating spectrograms as images, achieving >95% accuracy on AudioSet's 527-class ontology. The paradigm shifted with audio foundation models like CLAP (contrastive language-audio pretraining) and Whisper's encoder, which provide general-purpose audio representations that transfer to downstream tasks with minimal fine-tuning. The hard problems remain fine-grained classification in noisy real-world conditions, rare sound event detection with few examples, and efficient on-device inference for always-listening applications.

2
Datasets
8
Results
map
Canonical metric
Canonical Benchmark

AudioSet

2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.

Primary metric: map
View full leaderboard

Top 10

Leading models on AudioSet.

RankModelmapYearSource
1
BEATs
0.5062023paper
2
AST
0.4852021paper
3
HTS-AT
0.4712022paper
4
CLAP
0.4282023paper

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Audio.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace