Home/Building Blocks/Audio Classification
AudioStructured Data

Audio Classification

Classify audio into categories like music genres, environmental sounds, speaker emotions, or speech commands.

How Audio Classification Works

A technical deep-dive into Audio Classification. From mel spectrograms and MFCCs to transformers and zero-shot learning with CLAP.

1

What is Audio Classification?

Audio classification assigns labels to audio segments. The model listens to a clip and outputs what sounds are present. This powers everything from voice assistants to content moderation.

The Classification Pipeline

Load Audio
Read file, resample
16kHz mono
->
Preprocess
Extract features
Mel spectrogram
->
Model
Neural network
AST / CNN
->
Logits
Raw predictions
[527 classes]
->
Output
Final labels
Dog, Bark

Common Audio Classification Tasks

Music Genre
Classify music by genre
Rock, Jazz, Classical
Environmental Sound
Identify ambient sounds
Rain, Traffic, Siren
Speech Commands
Recognize spoken keywords
Yes, No, Stop, Go
Emotion Detection
Detect emotion in speech
Happy, Sad, Angry
Speaker ID
Identify who is speaking
Person A, Person B
Language ID
Detect spoken language
English, Spanish, French
Acoustic Events
Detect specific sounds
Gunshot, Glass Break
Audio Tagging
Multi-label classification
Dog + Bark + Outdoor
Voice Assistants

Wake word detection ("Hey Siri"), command recognition, and speaker verification for personalization.

Content Moderation

Detect harmful audio content, gunshots in videos, or inappropriate speech in live streams.

Smart Environments

Detect baby crying, glass breaking, smoke alarms, or unusual sounds for home security systems.

Single-Label vs Multi-Label Classification

Single-Label

One label per audio clip. Uses softmax and cross-entropy loss.

"dog.wav" -> Dog Barking
"jazz.mp3" -> Jazz
Multi-Label

Multiple labels per clip. Uses sigmoid and binary cross-entropy.

"park.wav" -> Dog, Birds, Wind, Speech
"street.wav" -> Traffic, Horn, Music
2

Audio Preprocessing

Raw audio is a 1D waveform of amplitude values over time. Models need structured features. The key insight: spectrograms turn audio into images, unlocking powerful vision techniques.

The Mel Scale: Why It Matters

Humans perceive pitch logarithmically. The difference between 100Hz and 200Hz sounds the same as 1000Hz to 2000Hz. The mel scale compresses frequencies to match human perception.

Frequency to Mel Mapping
HzMelNote
0 Hz0 melSilence
1000 Hz1000 melLinear below 1kHz
2000 Hz~1500 melLogarithmic above
4000 Hz~2100 melMatches perception
8000 Hz~2700 melCompressed high freq
Formula:
mel(f) = 2595 * log10(1 + f/700)
This is why mel spectrograms work so well - they represent audio the way we hear it.

Feature Extraction Methods

1
Mel Spectrogram

Time-frequency representation using mel scale (matches human hearing)

[Time x Mel Bins] e.g., 128 x 128
+Works great with CNNs, captures timbral info
-Loses phase information
2
MFCCs

Mel-Frequency Cepstral Coefficients - compact spectral features

[Time x N_MFCC] e.g., 128 x 13
+Very compact (13-40 features), traditional ASR standard
-Loses some information, less interpretable
3
Raw Waveform

Direct audio samples as input to the model

[Samples] e.g., 16000 for 1 sec
+No information loss, model learns features
-Needs more data, longer sequences
4
Log-Mel Filterbank

Log-scaled mel spectrogram (used by AST, Whisper)

[Time x 80/128] e.g., 100 x 80
+Better dynamic range, standard for transformers
-Requires careful normalization

How a Spectrogram is Created

~
Waveform
Amplitude vs time
STFT
FFT Windows
Sliding window FFT
Mel Filter
Mel Bins
128 triangular filters
Log Scale
Log-Mel
Compress dynamics
Result: 2D image where X = time, Y = frequency, color = amplitude (dB)
Sampling Rate: Most models expect 16kHz (speech) or 22.05kHz (music). AudioSet models use 16kHz. Always resample to match the model's training data. Higher sample rates capture more high-frequency detail but require more compute.
3

Model Architectures

Four main approaches dominate audio classification: CNNs on spectrograms, Transformers on patches, raw waveform models, and contrastive audio-text models.

1
CNN on Spectrogram
Treat audio as image
Examples: VGGish, YAMNet
Audio -> Mel Spec -> 2D CNN -> FC -> Classes
2
Transformer on Patches
Split spectrogram into patches
Examples: AST, HTS-AT
Audio -> Mel Spec -> Patches -> Transformer -> Classes
3
Raw Waveform
Learn from raw samples
Examples: Wav2Vec2, HuBERT
Audio -> Conv Encoder -> Transformer -> Classes
4
Contrastive Learning
Align audio with text
Examples: CLAP, AudioCLIP
Audio -> Encoder | Text -> Encoder -> Similarity

CNN on Spectrogram (VGGish, YAMNet)

The key insight: a spectrogram is just an image. We can use proven image classification architectures directly.

Input
Mel spectrogram as 2D image
e.g., (128 mel x 128 time x 1 channel)
Architecture
Conv layers extract local patterns
Pooling reduces dimensionality
Output
Global pooling + FC layer
Softmax/sigmoid for classes

Audio Spectrogram Transformer (AST)

AST treats the spectrogram like ViT treats images: split into patches, add positional embeddings, and feed to a transformer encoder.

Mel Spec
128 x 1024
->
Patches
16x16 patches
->
Linear Proj
768-dim embeds
->
Transformer
12 layers
->
[CLS] token
-> Classes

AST achieves 0.459 mAP on AudioSet, a significant improvement over CNNs. The self-attention mechanism captures long-range temporal dependencies.

CLAP: Zero-Shot Audio Classification

CLAP (Contrastive Language-Audio Pretraining) learns a shared embedding space for audio and text. This enables zero-shot classification using natural language descriptions.

How it works:
  1. 1. Encode audio clip into embedding vector
  2. 2. Encode text labels into embedding vectors
  3. 3. Compute cosine similarity between audio and each label
  4. 4. Highest similarity = predicted class
Why it matters:
  • - No training data needed for new classes
  • - Classes can be any text description
  • - Great for prototyping and rare events
  • - Combine with few-shot for best results
4

Models Comparison

From lightweight YAMNet to state-of-the-art BEATs. Choose based on your accuracy, speed, and flexibility requirements.

ModelTypeArchitectureSpeedAudioSetNotes
YAMNetCNNMobileNetV1 on mel spectrogramVery FastmAP: 0.306521 AudioSet classes, lightweight, production-ready
ASTTransformerAudio Spectrogram Transformer (ViT-based)MediummAP: 0.459SOTA on AudioSet, patches spectrograms like images
Wav2Vec2TransformerSelf-supervised on raw waveformSlowVariesPre-trained on 60K hours, fine-tune for classification
CLAPMultimodalAudio + Text encoder (contrastive)MediumZero-shot capableZero-shot classification via text prompts, very flexible
WhisperTransformerEncoder-decoder for ASRSlowSpeech-focusedEncoder can be fine-tuned for classification tasks
BEATsTransformerAudio pre-training with acoustic tokenizersMediummAP: 0.505Current SOTA on AudioSet, iterative self-distillation
Use YAMNet when:
  • - You need fast, lightweight inference
  • - 521 AudioSet classes are sufficient
  • - Running on mobile or edge devices
Use AST/BEATs when:
  • - Maximum accuracy is critical
  • - GPU inference is available
  • - Fine-tuning on custom data
Use Wav2Vec2 when:
  • - Speech-focused tasks (emotion, speaker)
  • - Low-resource languages
  • - Pre-trained representations matter
Use CLAP when:
  • - Zero-shot classification needed
  • - Classes change frequently
  • - Natural language class descriptions

Model Sizes (Approximate)

YAMNet
3.7M params
AST-Base
86M params
Wav2Vec2-Base
95M params
CLAP-HTSAT
142M params
5

Benchmarks and Datasets

Standard datasets for evaluating audio classification. mAP for multi-label, accuracy for single-label.

DatasetDomainClassesSizeMetricSOTA
AudioSetGeneral Audio5272M clipsmAP0.505 (BEATs)
ESC-50Environmental502000Accuracy97.0% (BEATs)
Speech CommandsKeywords35105KAccuracy98.7% (KWT)
FSD50KFreesound20051KmAP0.590 (PaSST)
VoxCeleb1Speaker ID1251153KEER0.87% (ECAPA-TDNN)
GTZANMusic Genre101000Accuracy93.9%

Understanding the Metrics

mAP (Mean Average Precision)
Average of per-class AP. Used for multi-label classification where multiple sounds can occur together.
Range: 0 to 1 (higher is better)
Accuracy
Percentage of correctly classified samples. Used for single-label classification tasks.
Range: 0% to 100%
EER (Equal Error Rate)
Rate where false accept = false reject. Used for speaker verification tasks.
Range: 0% to 50% (lower is better)

AudioSet: The ImageNet of Audio

AudioSet is a massive dataset of 10-second YouTube clips annotated with 527 sound classes. It is the de facto benchmark for general audio classification, similar to ImageNet for vision.

Total Clips
2.1M clips
Classes
527 labels
Multi-label
Yes (avg 2.7/clip)
Duration
10 sec each
Human soundsAnimalMusicEnvironmentVehiclesTools

ESC-50: Environmental Sound Classification

A smaller, cleaner benchmark with 50 classes of environmental sounds. 2000 clips, 5-second each. Perfect for quick experiments.

Animals (dog, cat, crow...)Natural (rain, wind, thunder...)Human (cough, laugh, snore...)Domestic (clock, door, keyboard...)Urban (helicopter, siren, train...)
6

Code Examples

Get started with audio classification in Python. From quick pipelines to zero-shot CLAP.

Transformers Pipelinepip install transformers
Recommended
from transformers import pipeline

# Load pre-trained audio classification pipeline
classifier = pipeline(
    "audio-classification",
    model="MIT/ast-finetuned-audioset-10-10-0.4593"
)

# Classify an audio file
result = classifier("path/to/audio.wav")

for pred in result[:5]:
    print(f"{pred['label']:30} {pred['score']:.3f}")

# Output:
# Speech                         0.892
# Music                          0.067
# Silence                        0.023

Quick Reference

For Production
  • - YAMNet for speed (3.7M params)
  • - AST for accuracy (mAP 0.459)
  • - 16kHz sampling, mel spectrograms
For Prototyping
  • - CLAP for zero-shot
  • - Transformers pipeline
  • - ESC-50 for quick eval
Key Takeaways
  • - Spectrograms turn audio into images
  • - Mel scale matches human perception
  • - Transformers now beat CNNs

Use Cases

  • Music genre classification
  • Environmental sound detection
  • Emotion recognition
  • Voice command detection
  • Audio content moderation

Architectural Patterns

Spectrogram + CNN

Convert audio to mel spectrogram, classify with CNN.

Pros:
  • +Well-understood
  • +Fast inference
  • +Works well for many tasks
Cons:
  • -Fixed input length
  • -Loses temporal info

Audio Transformers

Use self-attention on audio features (AST, HuBERT).

Pros:
  • +State-of-the-art accuracy
  • +Long-range dependencies
Cons:
  • -Larger models
  • -More compute

Self-Supervised Pre-training

Pre-train on unlabeled audio, fine-tune for task.

Pros:
  • +Less labeled data needed
  • +Better generalization
Cons:
  • -Large pre-training cost
  • -Complex pipeline

Implementations

Open Source

Audio Spectrogram Transformer (AST)

BSD 3-Clause
Open Source

State-of-the-art on AudioSet. Pure transformer.

Wav2Vec2

Apache 2.0
Open Source

Self-supervised audio. Great for emotion, speaker ID.

CLAP

Apache 2.0
Open Source

Audio-text contrastive learning. Zero-shot classification.

YAMNet

Apache 2.0
Open Source

Google's audio event detection. 521 classes.

Whisper (for speech)

MIT
Open Source

Language detection, voice activity as side effects.

Benchmarks

Quick Facts

Input
Audio
Output
Structured Data
Implementations
5 open source, 0 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for audio classification.

Submit Results