Audio Classification
Audio classification identifies what's happening in a sound — music genre, environmental sounds, speaker emotion, language identification — and underpins everything from content moderation to smart home devices. Audio Spectrogram Transformer (AST) and BEATs brought ImageNet-level transfer learning to audio by treating spectrograms as images, achieving >95% accuracy on AudioSet's 527-class ontology. The paradigm shifted with audio foundation models like CLAP (contrastive language-audio pretraining) and Whisper's encoder, which provide general-purpose audio representations that transfer to downstream tasks with minimal fine-tuning. The hard problems remain fine-grained classification in noisy real-world conditions, rare sound event detection with few examples, and efficient on-device inference for always-listening applications.
Audio classification assigns labels to audio clips — identifying environmental sounds, music genres, speaker emotions, or acoustic events. Audio Spectrogram Transformer (AST) and BEATs brought ImageNet-scale pretraining to audio, achieving 98%+ accuracy on AudioSet for common sound events. The task powers content moderation, environmental monitoring, and smart device triggers.
History
ESC-50 dataset (Piczak) standardizes environmental sound classification with 50 categories
AudioSet (Google) provides 2M human-labeled 10-second clips spanning 632 sound event categories
PANNs (Kong et al.) achieve strong AudioSet performance with CNN14, establishing a practical baseline
Audio Spectrogram Transformer (AST, Gong et al.) adapts Vision Transformer to audio spectrograms, reaching 45.9 mAP on AudioSet
HTS-AT (Chen et al.) combines hierarchical token-semantic transformers, pushing AudioSet mAP to 47.1
BEATs (Microsoft) introduces audio pretraining with tokenized audio labels, achieving 50.6 mAP on AudioSet
CLAP (LAION) aligns audio and text in a shared embedding space, enabling zero-shot audio classification
EAT (Efficient Audio Transformer) and M2D push self-supervised audio pretraining with masked spectrogram modeling
Audio foundation models handle classification as one of many downstream tasks alongside captioning, QA, and retrieval
How Audio Classification Works
Audio preprocessing
Raw audio is converted to log-mel spectrograms (128 mel bins, 25ms windows) — treating audio as a 2D image
Patch embedding
The spectrogram is divided into fixed-size patches (e.g., 16x16) and linearly projected to token embeddings
Transformer encoding
Self-attention layers process patch tokens, capturing both local frequency patterns and long-range temporal structure
Classification
A [CLS] token or mean-pooled representation is fed to a multi-label classification head (sigmoid for multi-label, softmax for single-label)
Aggregation
For clips longer than the model context, predictions from multiple windows are aggregated (max-pooling or attention-weighted)
Current Landscape
Audio classification in 2025 has been transformed by the same self-supervised pretraining revolution that reshaped NLP and vision. Vision Transformer-based architectures (AST, BEATs, EAT) treat spectrograms as images and leverage ImageNet pretraining or masked audio modeling. AudioSet remains the central benchmark, but its noisy labels make progress hard to measure precisely. CLAP has opened up zero-shot classification, analogous to CLIP's impact on vision. Production deployments use lighter CNN models (YAMNet, PANNs) for latency-sensitive applications while transformer models handle quality-critical offline classification.
Key Challenges
Class imbalance in AudioSet: common sounds (speech, music) have 100x more examples than rare events (gunshots, glass breaking)
Noisy labels: AudioSet annotations are crowd-sourced and contain ~15-20% label noise, capping effective model accuracy
Real-world audio contains overlapping events — a single clip may have speech, music, and traffic simultaneously
Domain shift between AudioSet (YouTube clips) and deployment environments (surveillance, IoT, medical)
Temporal resolution: classifying when events occur within a clip (sound event detection) is harder than clip-level classification
Quick Recommendations
Best accuracy (AudioSet)
BEATs or EAT
50+ mAP on AudioSet; self-supervised pretraining captures rich audio representations
Zero-shot audio classification
CLAP (LAION) or Whisper-AT
Classify audio with arbitrary text descriptions without task-specific training
Production (lightweight)
PANNs CNN14 or YAMNet
Efficient CNN-based classifiers that run in real-time on edge devices
Environmental sound monitoring
AST fine-tuned on ESC-50 or UrbanSound8K
97%+ accuracy on environmental sound classification benchmarks
Music classification
MERT or MusicNN
Specialized for music genre, mood, and instrument recognition tasks
What's Next
Expect audio classification to merge into general audio understanding — a single model that classifies, captions, answers questions about, and retrieves audio. Fine-grained temporal event detection (not just 'this clip contains a dog bark' but 'a dog barks at 2.3 seconds for 0.5 seconds') will improve through frame-level models. On-device classification for smart home, wearable, and IoT applications will drive efficient architectures under 5M parameters.
Benchmarks & SOTA
AudioSet
AudioSet
2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.
No results tracked yet
ESC-50
Environmental Sound Classification 50
2,000 environmental audio recordings organized into 50 classes (animals, natural soundscapes, etc.).
No results tracked yet
Related Tasks
Audio Captioning
Generating text descriptions of audio content.
Music Generation
Generating music from text, audio, or other inputs.
Sound Event Detection
Detecting and localizing sound events in audio.
Text-to-Audio
Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Audio Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.