Audioaudio-classification

Audio Classification

Audio classification identifies what's happening in a sound — music genre, environmental sounds, speaker emotion, language identification — and underpins everything from content moderation to smart home devices. Audio Spectrogram Transformer (AST) and BEATs brought ImageNet-level transfer learning to audio by treating spectrograms as images, achieving >95% accuracy on AudioSet's 527-class ontology. The paradigm shifted with audio foundation models like CLAP (contrastive language-audio pretraining) and Whisper's encoder, which provide general-purpose audio representations that transfer to downstream tasks with minimal fine-tuning. The hard problems remain fine-grained classification in noisy real-world conditions, rare sound event detection with few examples, and efficient on-device inference for always-listening applications.

2 datasets0 resultsView full task mapping →

Audio classification assigns labels to audio clips — identifying environmental sounds, music genres, speaker emotions, or acoustic events. Audio Spectrogram Transformer (AST) and BEATs brought ImageNet-scale pretraining to audio, achieving 98%+ accuracy on AudioSet for common sound events. The task powers content moderation, environmental monitoring, and smart device triggers.

History

2014

ESC-50 dataset (Piczak) standardizes environmental sound classification with 50 categories

2017

AudioSet (Google) provides 2M human-labeled 10-second clips spanning 632 sound event categories

2019

PANNs (Kong et al.) achieve strong AudioSet performance with CNN14, establishing a practical baseline

2021

Audio Spectrogram Transformer (AST, Gong et al.) adapts Vision Transformer to audio spectrograms, reaching 45.9 mAP on AudioSet

2022

HTS-AT (Chen et al.) combines hierarchical token-semantic transformers, pushing AudioSet mAP to 47.1

2023

BEATs (Microsoft) introduces audio pretraining with tokenized audio labels, achieving 50.6 mAP on AudioSet

2023

CLAP (LAION) aligns audio and text in a shared embedding space, enabling zero-shot audio classification

2024

EAT (Efficient Audio Transformer) and M2D push self-supervised audio pretraining with masked spectrogram modeling

2025

Audio foundation models handle classification as one of many downstream tasks alongside captioning, QA, and retrieval

How Audio Classification Works

Audio preprocessing

Raw audio is converted to log-mel spectrograms (128 mel bins, 25ms windows) — treating audio as a 2D image

Patch embedding

The spectrogram is divided into fixed-size patches (e.g., 16x16) and linearly projected to token embeddings

Transformer encoding

Self-attention layers process patch tokens, capturing both local frequency patterns and long-range temporal structure

Classification

A [CLS] token or mean-pooled representation is fed to a multi-label classification head (sigmoid for multi-label, softmax for single-label)

Aggregation

For clips longer than the model context, predictions from multiple windows are aggregated (max-pooling or attention-weighted)

Current Landscape

Audio classification in 2025 has been transformed by the same self-supervised pretraining revolution that reshaped NLP and vision. Vision Transformer-based architectures (AST, BEATs, EAT) treat spectrograms as images and leverage ImageNet pretraining or masked audio modeling. AudioSet remains the central benchmark, but its noisy labels make progress hard to measure precisely. CLAP has opened up zero-shot classification, analogous to CLIP's impact on vision. Production deployments use lighter CNN models (YAMNet, PANNs) for latency-sensitive applications while transformer models handle quality-critical offline classification.

Key Challenges

Class imbalance in AudioSet: common sounds (speech, music) have 100x more examples than rare events (gunshots, glass breaking)

Noisy labels: AudioSet annotations are crowd-sourced and contain ~15-20% label noise, capping effective model accuracy

Real-world audio contains overlapping events — a single clip may have speech, music, and traffic simultaneously

Domain shift between AudioSet (YouTube clips) and deployment environments (surveillance, IoT, medical)

Temporal resolution: classifying when events occur within a clip (sound event detection) is harder than clip-level classification

Quick Recommendations

Best accuracy (AudioSet)

BEATs or EAT

50+ mAP on AudioSet; self-supervised pretraining captures rich audio representations

Zero-shot audio classification

CLAP (LAION) or Whisper-AT

Classify audio with arbitrary text descriptions without task-specific training

Production (lightweight)

PANNs CNN14 or YAMNet

Efficient CNN-based classifiers that run in real-time on edge devices

Environmental sound monitoring

AST fine-tuned on ESC-50 or UrbanSound8K

97%+ accuracy on environmental sound classification benchmarks

Music classification

MERT or MusicNN

Specialized for music genre, mood, and instrument recognition tasks

What's Next

Expect audio classification to merge into general audio understanding — a single model that classifies, captions, answers questions about, and retrieves audio. Fine-grained temporal event detection (not just 'this clip contains a dog bark' but 'a dog barks at 2.3 seconds for 0.5 seconds') will improve through frame-level models. On-device classification for smart home, wearable, and IoT applications will drive efficient architectures under 5M parameters.

Benchmarks & SOTA

AudioSet

20170 results

2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.

No results tracked yet

ESC-50

Environmental Sound Classification 50

20150 results

2,000 environmental audio recordings organized into 50 classes (animals, natural soundscapes, etc.).

No results tracked yet

Related Tasks

Audio Captioning

Generating text descriptions of audio content.

Music Generation

Generating music from text, audio, or other inputs.

Sound Event Detection

Detecting and localizing sound events in audio.

Text-to-Audio

Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Audio Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Audio