Audio Classification
Classify audio into categories like music genres, environmental sounds, speaker emotions, or speech commands.
How Audio Classification Works
A technical deep-dive into Audio Classification. From mel spectrograms and MFCCs to transformers and zero-shot learning with CLAP.
What is Audio Classification?
Audio classification assigns labels to audio segments. The model listens to a clip and outputs what sounds are present. This powers everything from voice assistants to content moderation.
The Classification Pipeline
Common Audio Classification Tasks
Wake word detection ("Hey Siri"), command recognition, and speaker verification for personalization.
Detect harmful audio content, gunshots in videos, or inappropriate speech in live streams.
Detect baby crying, glass breaking, smoke alarms, or unusual sounds for home security systems.
Single-Label vs Multi-Label Classification
One label per audio clip. Uses softmax and cross-entropy loss.
"jazz.mp3" -> Jazz
Multiple labels per clip. Uses sigmoid and binary cross-entropy.
"street.wav" -> Traffic, Horn, Music
Audio Preprocessing
Raw audio is a 1D waveform of amplitude values over time. Models need structured features. The key insight: spectrograms turn audio into images, unlocking powerful vision techniques.
The Mel Scale: Why It Matters
Humans perceive pitch logarithmically. The difference between 100Hz and 200Hz sounds the same as 1000Hz to 2000Hz. The mel scale compresses frequencies to match human perception.
| Hz | Mel | Note |
|---|---|---|
| 0 Hz | 0 mel | Silence |
| 1000 Hz | 1000 mel | Linear below 1kHz |
| 2000 Hz | ~1500 mel | Logarithmic above |
| 4000 Hz | ~2100 mel | Matches perception |
| 8000 Hz | ~2700 mel | Compressed high freq |
Feature Extraction Methods
Time-frequency representation using mel scale (matches human hearing)
Mel-Frequency Cepstral Coefficients - compact spectral features
Direct audio samples as input to the model
Log-scaled mel spectrogram (used by AST, Whisper)
How a Spectrogram is Created
Model Architectures
Four main approaches dominate audio classification: CNNs on spectrograms, Transformers on patches, raw waveform models, and contrastive audio-text models.
CNN on Spectrogram (VGGish, YAMNet)
The key insight: a spectrogram is just an image. We can use proven image classification architectures directly.
e.g., (128 mel x 128 time x 1 channel)
Pooling reduces dimensionality
Softmax/sigmoid for classes
Audio Spectrogram Transformer (AST)
AST treats the spectrogram like ViT treats images: split into patches, add positional embeddings, and feed to a transformer encoder.
AST achieves 0.459 mAP on AudioSet, a significant improvement over CNNs. The self-attention mechanism captures long-range temporal dependencies.
CLAP: Zero-Shot Audio Classification
CLAP (Contrastive Language-Audio Pretraining) learns a shared embedding space for audio and text. This enables zero-shot classification using natural language descriptions.
- 1. Encode audio clip into embedding vector
- 2. Encode text labels into embedding vectors
- 3. Compute cosine similarity between audio and each label
- 4. Highest similarity = predicted class
- - No training data needed for new classes
- - Classes can be any text description
- - Great for prototyping and rare events
- - Combine with few-shot for best results
Models Comparison
From lightweight YAMNet to state-of-the-art BEATs. Choose based on your accuracy, speed, and flexibility requirements.
| Model | Type | Architecture | Speed | AudioSet | Notes |
|---|---|---|---|---|---|
| YAMNet | CNN | MobileNetV1 on mel spectrogram | Very Fast | mAP: 0.306 | 521 AudioSet classes, lightweight, production-ready |
| AST | Transformer | Audio Spectrogram Transformer (ViT-based) | Medium | mAP: 0.459 | SOTA on AudioSet, patches spectrograms like images |
| Wav2Vec2 | Transformer | Self-supervised on raw waveform | Slow | Varies | Pre-trained on 60K hours, fine-tune for classification |
| CLAP | Multimodal | Audio + Text encoder (contrastive) | Medium | Zero-shot capable | Zero-shot classification via text prompts, very flexible |
| Whisper | Transformer | Encoder-decoder for ASR | Slow | Speech-focused | Encoder can be fine-tuned for classification tasks |
| BEATs | Transformer | Audio pre-training with acoustic tokenizers | Medium | mAP: 0.505 | Current SOTA on AudioSet, iterative self-distillation |
- - You need fast, lightweight inference
- - 521 AudioSet classes are sufficient
- - Running on mobile or edge devices
- - Maximum accuracy is critical
- - GPU inference is available
- - Fine-tuning on custom data
- - Speech-focused tasks (emotion, speaker)
- - Low-resource languages
- - Pre-trained representations matter
- - Zero-shot classification needed
- - Classes change frequently
- - Natural language class descriptions
Model Sizes (Approximate)
Benchmarks and Datasets
Standard datasets for evaluating audio classification. mAP for multi-label, accuracy for single-label.
| Dataset | Domain | Classes | Size | Metric | SOTA |
|---|---|---|---|---|---|
| AudioSet | General Audio | 527 | 2M clips | mAP | 0.505 (BEATs) |
| ESC-50 | Environmental | 50 | 2000 | Accuracy | 97.0% (BEATs) |
| Speech Commands | Keywords | 35 | 105K | Accuracy | 98.7% (KWT) |
| FSD50K | Freesound | 200 | 51K | mAP | 0.590 (PaSST) |
| VoxCeleb1 | Speaker ID | 1251 | 153K | EER | 0.87% (ECAPA-TDNN) |
| GTZAN | Music Genre | 10 | 1000 | Accuracy | 93.9% |
Understanding the Metrics
AudioSet: The ImageNet of Audio
AudioSet is a massive dataset of 10-second YouTube clips annotated with 527 sound classes. It is the de facto benchmark for general audio classification, similar to ImageNet for vision.
ESC-50: Environmental Sound Classification
A smaller, cleaner benchmark with 50 classes of environmental sounds. 2000 clips, 5-second each. Perfect for quick experiments.
Code Examples
Get started with audio classification in Python. From quick pipelines to zero-shot CLAP.
from transformers import pipeline
# Load pre-trained audio classification pipeline
classifier = pipeline(
"audio-classification",
model="MIT/ast-finetuned-audioset-10-10-0.4593"
)
# Classify an audio file
result = classifier("path/to/audio.wav")
for pred in result[:5]:
print(f"{pred['label']:30} {pred['score']:.3f}")
# Output:
# Speech 0.892
# Music 0.067
# Silence 0.023Quick Reference
- - YAMNet for speed (3.7M params)
- - AST for accuracy (mAP 0.459)
- - 16kHz sampling, mel spectrograms
- - CLAP for zero-shot
- - Transformers pipeline
- - ESC-50 for quick eval
- - Spectrograms turn audio into images
- - Mel scale matches human perception
- - Transformers now beat CNNs
Use Cases
- ✓Music genre classification
- ✓Environmental sound detection
- ✓Emotion recognition
- ✓Voice command detection
- ✓Audio content moderation
Architectural Patterns
Spectrogram + CNN
Convert audio to mel spectrogram, classify with CNN.
- +Well-understood
- +Fast inference
- +Works well for many tasks
- -Fixed input length
- -Loses temporal info
Audio Transformers
Use self-attention on audio features (AST, HuBERT).
- +State-of-the-art accuracy
- +Long-range dependencies
- -Larger models
- -More compute
Self-Supervised Pre-training
Pre-train on unlabeled audio, fine-tune for task.
- +Less labeled data needed
- +Better generalization
- -Large pre-training cost
- -Complex pipeline
Implementations
Open Source
Audio Spectrogram Transformer (AST)
BSD 3-ClauseState-of-the-art on AudioSet. Pure transformer.
Benchmarks
Quick Facts
- Input
- Audio
- Output
- Structured Data
- Implementations
- 5 open source, 0 API
- Patterns
- 3 approaches