Home/Building Blocks/Audio Classification

Audio→Structured Data

Audio Classification

Classify audio into categories like music genres, environmental sounds, speaker emotions, or speech commands.

How Audio Classification Works

A technical deep-dive into Audio Classification. From mel spectrograms and MFCCs to transformers and zero-shot learning with CLAP.

1. What is Audio Classification 2. Audio Preprocessing 3. Architectures 4. Models 5. Benchmarks 6. Code

What is Audio Classification?

Audio classification assigns labels to audio segments. The model listens to a clip and outputs what sounds are present. This powers everything from voice assistants to content moderation.

The Classification Pipeline

Load Audio

Read file, resample

16kHz mono

Preprocess

Extract features

Mel spectrogram

Model

Neural network

AST / CNN

Logits

Raw predictions

[527 classes]

Output

Final labels

Dog, Bark

Common Audio Classification Tasks

Music Genre

Classify music by genre

Rock, Jazz, Classical

Environmental Sound

Identify ambient sounds

Rain, Traffic, Siren

Speech Commands

Recognize spoken keywords

Yes, No, Stop, Go

Emotion Detection

Detect emotion in speech

Happy, Sad, Angry

Speaker ID

Identify who is speaking

Person A, Person B

Language ID

Detect spoken language

English, Spanish, French

Acoustic Events

Detect specific sounds

Gunshot, Glass Break

Audio Tagging

Multi-label classification

Dog + Bark + Outdoor

Voice Assistants

Wake word detection ("Hey Siri"), command recognition, and speaker verification for personalization.

Content Moderation

Detect harmful audio content, gunshots in videos, or inappropriate speech in live streams.

Smart Environments

Detect baby crying, glass breaking, smoke alarms, or unusual sounds for home security systems.

Single-Label vs Multi-Label Classification

Single-Label

One label per audio clip. Uses softmax and cross-entropy loss.

"dog.wav" -> Dog Barking
"jazz.mp3" -> Jazz

Multi-Label

Multiple labels per clip. Uses sigmoid and binary cross-entropy.

"park.wav" -> Dog, Birds, Wind, Speech
"street.wav" -> Traffic, Horn, Music

Audio Preprocessing

Raw audio is a 1D waveform of amplitude values over time. Models need structured features. The key insight: spectrograms turn audio into images, unlocking powerful vision techniques.

The Mel Scale: Why It Matters

Humans perceive pitch logarithmically. The difference between 100Hz and 200Hz sounds the same as 1000Hz to 2000Hz. The mel scale compresses frequencies to match human perception.

Frequency to Mel Mapping

Hz	Mel	Note
0 Hz	0 mel	Silence
1000 Hz	1000 mel	Linear below 1kHz
2000 Hz	~1500 mel	Logarithmic above
4000 Hz	~2100 mel	Matches perception
8000 Hz	~2700 mel	Compressed high freq

Formula:

mel(f) = 2595 * log10(1 + f/700)

This is why mel spectrograms work so well - they represent audio the way we hear it.

Feature Extraction Methods

Mel Spectrogram

Time-frequency representation using mel scale (matches human hearing)

[Time x Mel Bins] e.g., 128 x 128

+Works great with CNNs, captures timbral info

-Loses phase information

MFCCs

Mel-Frequency Cepstral Coefficients - compact spectral features

[Time x N_MFCC] e.g., 128 x 13

+Very compact (13-40 features), traditional ASR standard

-Loses some information, less interpretable

Raw Waveform

Direct audio samples as input to the model

[Samples] e.g., 16000 for 1 sec

+No information loss, model learns features

-Needs more data, longer sequences

Log-Mel Filterbank

Log-scaled mel spectrogram (used by AST, Whisper)

[Time x 80/128] e.g., 100 x 80

+Better dynamic range, standard for transformers

-Requires careful normalization

How a Spectrogram is Created

Waveform

Amplitude vs time

STFT

FFT Windows

Sliding window FFT

Mel Filter

Mel Bins

128 triangular filters

Log Scale

Log-Mel

Compress dynamics

Result: 2D image where X = time, Y = frequency, color = amplitude (dB)

Sampling Rate: Most models expect 16kHz (speech) or 22.05kHz (music). AudioSet models use 16kHz. Always resample to match the model's training data. Higher sample rates capture more high-frequency detail but require more compute.

Model Architectures

Four main approaches dominate audio classification: CNNs on spectrograms, Transformers on patches, raw waveform models, and contrastive audio-text models.

CNN on Spectrogram

Treat audio as image

Examples: VGGish, YAMNet

Audio -> Mel Spec -> 2D CNN -> FC -> Classes

Transformer on Patches

Split spectrogram into patches

Examples: AST, HTS-AT

Audio -> Mel Spec -> Patches -> Transformer -> Classes

Raw Waveform

Learn from raw samples

Examples: Wav2Vec2, HuBERT

Audio -> Conv Encoder -> Transformer -> Classes

Contrastive Learning

Align audio with text

Examples: CLAP, AudioCLIP

Audio -> Encoder | Text -> Encoder -> Similarity

CNN on Spectrogram (VGGish, YAMNet)

The key insight: a spectrogram is just an image. We can use proven image classification architectures directly.

Input

Mel spectrogram as 2D image
e.g., (128 mel x 128 time x 1 channel)

Architecture

Conv layers extract local patterns
Pooling reduces dimensionality

Output

Global pooling + FC layer
Softmax/sigmoid for classes

Audio Spectrogram Transformer (AST)

AST treats the spectrogram like ViT treats images: split into patches, add positional embeddings, and feed to a transformer encoder.

Mel Spec

128 x 1024

Patches

16x16 patches

Linear Proj

768-dim embeds

Transformer

12 layers

[CLS] token

-> Classes

AST achieves 0.459 mAP on AudioSet, a significant improvement over CNNs. The self-attention mechanism captures long-range temporal dependencies.

CLAP: Zero-Shot Audio Classification

CLAP (Contrastive Language-Audio Pretraining) learns a shared embedding space for audio and text. This enables zero-shot classification using natural language descriptions.

How it works:

1. Encode audio clip into embedding vector
2. Encode text labels into embedding vectors
3. Compute cosine similarity between audio and each label
4. Highest similarity = predicted class

Why it matters:

- No training data needed for new classes
- Classes can be any text description
- Great for prototyping and rare events
- Combine with few-shot for best results

Models Comparison

From lightweight YAMNet to state-of-the-art BEATs. Choose based on your accuracy, speed, and flexibility requirements.

Model	Type	Architecture	Speed	AudioSet	Notes
YAMNet	CNN	MobileNetV1 on mel spectrogram	Very Fast	mAP: 0.306	521 AudioSet classes, lightweight, production-ready
AST	Transformer	Audio Spectrogram Transformer (ViT-based)	Medium	mAP: 0.459	SOTA on AudioSet, patches spectrograms like images
Wav2Vec2	Transformer	Self-supervised on raw waveform	Slow	Varies	Pre-trained on 60K hours, fine-tune for classification
CLAP	Multimodal	Audio + Text encoder (contrastive)	Medium	Zero-shot capable	Zero-shot classification via text prompts, very flexible
Whisper	Transformer	Encoder-decoder for ASR	Slow	Speech-focused	Encoder can be fine-tuned for classification tasks
BEATs	Transformer	Audio pre-training with acoustic tokenizers	Medium	mAP: 0.505	Current SOTA on AudioSet, iterative self-distillation

Use YAMNet when:

- You need fast, lightweight inference
- 521 AudioSet classes are sufficient
- Running on mobile or edge devices

Use AST/BEATs when:

- Maximum accuracy is critical
- GPU inference is available
- Fine-tuning on custom data

Use Wav2Vec2 when:

- Speech-focused tasks (emotion, speaker)
- Low-resource languages
- Pre-trained representations matter

Use CLAP when:

- Zero-shot classification needed
- Classes change frequently
- Natural language class descriptions

Model Sizes (Approximate)

YAMNet

3.7M params

AST-Base

86M params

Wav2Vec2-Base

95M params

CLAP-HTSAT

142M params

Benchmarks and Datasets

Standard datasets for evaluating audio classification. mAP for multi-label, accuracy for single-label.

Dataset	Domain	Classes	Size	Metric	SOTA
AudioSet	General Audio	527	2M clips	mAP	0.505 (BEATs)
ESC-50	Environmental	50	2000	Accuracy	97.0% (BEATs)
Speech Commands	Keywords	35	105K	Accuracy	98.7% (KWT)
FSD50K	Freesound	200	51K	mAP	0.590 (PaSST)
VoxCeleb1	Speaker ID	1251	153K	EER	0.87% (ECAPA-TDNN)
GTZAN	Music Genre	10	1000	Accuracy	93.9%

Understanding the Metrics

mAP (Mean Average Precision)

Average of per-class AP. Used for multi-label classification where multiple sounds can occur together.

Range: 0 to 1 (higher is better)

Accuracy

Percentage of correctly classified samples. Used for single-label classification tasks.

Range: 0% to 100%

EER (Equal Error Rate)

Rate where false accept = false reject. Used for speaker verification tasks.

Range: 0% to 50% (lower is better)

AudioSet: The ImageNet of Audio

AudioSet is a massive dataset of 10-second YouTube clips annotated with 527 sound classes. It is the de facto benchmark for general audio classification, similar to ImageNet for vision.

Total Clips

2.1M clips

Classes

527 labels

Multi-label

Yes (avg 2.7/clip)

Duration

10 sec each

Human soundsAnimalMusicEnvironmentVehiclesTools

ESC-50: Environmental Sound Classification

A smaller, cleaner benchmark with 50 classes of environmental sounds. 2000 clips, 5-second each. Perfect for quick experiments.

Animals (dog, cat, crow...)Natural (rain, wind, thunder...)Human (cough, laugh, snore...)Domestic (clock, door, keyboard...)Urban (helicopter, siren, train...)

Code Examples

Get started with audio classification in Python. From quick pipelines to zero-shot CLAP.

Transformers Pipelinepip install transformers

Recommended

from transformers import pipeline

# Load pre-trained audio classification pipeline
classifier = pipeline(
    "audio-classification",
    model="MIT/ast-finetuned-audioset-10-10-0.4593"
)

# Classify an audio file
result = classifier("path/to/audio.wav")

for pred in result[:5]:
    print(f"{pred['label']:30} {pred['score']:.3f}")

# Output:
# Speech                         0.892
# Music                          0.067
# Silence                        0.023