Audio Classification
Audio classification identifies what's happening in a sound — music genre, environmental sounds, speaker emotion, language identification — and underpins everything from content moderation to smart home devices. Audio Spectrogram Transformer (AST) and BEATs brought ImageNet-level transfer learning to audio by treating spectrograms as images, achieving >95% accuracy on AudioSet's 527-class ontology. The paradigm shifted with audio foundation models like CLAP (contrastive language-audio pretraining) and Whisper's encoder, which provide general-purpose audio representations that transfer to downstream tasks with minimal fine-tuning. The hard problems remain fine-grained classification in noisy real-world conditions, rare sound event detection with few examples, and efficient on-device inference for always-listening applications.
AudioSet
2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.
Top 10
Leading models on AudioSet.
All datasets
2 datasets tracked for this task.
Related tasks
Other tasks in Audio.
Looking to run a model? HuggingFace hosts inference for this task type.