Audio AI Benchmark

Understanding
Audio Intelligence

From classifying environmental sounds to generating music, audio AI has evolved rapidly. Compare models on AudioSet, ESC-50, and explore the cutting edge of sound understanding.

Benchmark Stats

0.498
Best mAP (AudioSet)
98.1%
Best Accuracy (ESC-50)
632
AudioSet Classes

How Audio AI Works

Modern audio models don't process raw waveforms. They convert audio into visual representations called spectrograms, then apply computer vision techniques. Here's the pipeline:

Audio Waveform
Step 1: Raw Audio

Waveform Input

Audio starts as a 1D waveform signal, typically sampled at 16kHz or 22kHz. This raw representation captures amplitude over time but lacks frequency information.

Mel Spectrogram
Step 2: Spectrogram

Mel Spectrogram Conversion

The waveform is converted to a 2D mel spectrogram using STFT + mel filterbank. This creates an "image" where X is time, Y is frequency (mel scale), and color is intensity.

Feature Patches
Step 3: Model

Transformer / CNN Processing

The spectrogram is processed by a Vision Transformer (ViT) or CNN. Patch embeddings capture local patterns, attention mechanisms capture long-range dependencies.

audio_classification.py
import torch
import torchaudio
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification

# Load audio file (16kHz mono)
waveform, sr = torchaudio.load("sound.wav")

# Load pretrained AST model
extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
model = AutoModelForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

# Extract mel spectrogram features
inputs = extractor(waveform.squeeze(), sampling_rate=sr, return_tensors="pt")

# Classify
with torch.no_grad():
    logits = model(**inputs).logits
    predicted_class = logits.argmax(-1).item()

print(f"Predicted: {model.config.id2label[predicted_class]}")

Audio Classification

Mean Average Precision (mAP)

AudioSet is a multi-label classification problem. A 10-second clip might contain "Speech", "Dog barking", and "Music" simultaneously. We use mAP to measure how well the model ranks positive labels above negative ones.

1

Per-Class AP

Calculate Average Precision for each of the 632 classes based on ranking of predictions.

2

Mean Across Classes

Average all per-class APs to get mAP. Higher is better (0.0 to 1.0 scale).

AudioSet Scale

  • 2M+
    Audio Clips

    10-second segments from YouTube videos

  • 632
    Sound Classes

    Organized in hierarchical ontology

  • 527
    Evaluation Classes

    Filtered for quality and balance

  • ~21K
    Test Hours

    Eval set for benchmark comparison

AudioSet Leaderboard

Mean Average Precision on AudioSet evaluation set. Higher is better.

Rank Model mAP Architecture Type Year
#1
BEATs
Microsoft
0.498 Audio Tokenizer + Transformer Open Source 2023
#2
Audio Spectrogram Transformer (AST)
MIT/IBM
0.485 Vision Transformer Open Source 2021
#3
HTS-AT
Bytedance
0.471 Hierarchical Token-Semantic Audio Transformer Open Source 2022
#4
CLAP
LAION/Microsoft
0.463 Contrastive Learning Open Source 2023
#5
PANNs (CNN14)
ByteDance
0.431 CNN Open Source 2020
#6
Wav2Vec 2.0
Meta
0.392 Self-supervised Open Source 2020

ESC-50 Leaderboard

Accuracy on Environmental Sound Classification (50 classes, 5-fold cross-validation). Higher is better.

Rank Model Accuracy (%) Type Year
#1
BEATs
Microsoft
98.1 Open Source 2023
#2
CLAP
LAION/Microsoft
96.7 Open Source 2023
#3
AST
MIT/IBM
95.6 Open Source 2021
#4
PANNs
ByteDance
94.7 Open Source 2020
#5
wav2vec 2.0 + Linear
Meta
92.3 Open Source 2020

Music Generation

The New Era of AI Music

2024 marked a breakthrough in music generation. Models like Suno and Udio can now generate full songs with vocals, lyrics, and production quality rivaling professional studios.

Unlike TTS which synthesizes speech from text, music generation creates complex multi-track compositions, handling melody, harmony, rhythm, lyrics, and vocal performance simultaneously.

Key Capabilities

  • Text-to-Music: Describe a song, get audio
  • Lyrics + Melody: Generate vocals with coherent lyrics
  • Style Transfer: Convert between genres
  • Continuation: Extend existing audio clips

Evaluation Challenge

Unlike classification where we have ground truth labels, music generation quality is subjective. Current evaluation methods include:

  • FAD
    Frechet Audio Distance

    Statistical distance between generated and real music embeddings

  • MOS
    Mean Opinion Score

    Human ratings on quality, coherence, and musicality (1-5 scale)

  • KLD
    KL Divergence

    Distribution similarity for genre/instrument classification

Music Generation Models

Comparison of text-to-music and audio generation models. Quality assessed via community consensus and published evaluations.

Model Quality Key Features Type Year
Suno v3.5
Suno
Excellent Full songs with vocals, lyrics generation Cloud API 2024
Udio
Udio
Excellent High-quality vocals, genre diversity Cloud API 2024
MusicGen
Meta
Good Text-to-music, melody conditioning Open Source 2023
Stable Audio 2.0
Stability AI
Good Long-form generation, audio-to-audio Open Source 2024
AudioCraft
Meta
Good MusicGen + AudioGen combined Open Source 2023
Riffusion
Community
Fair Spectrogram diffusion Open Source 2023

Audio Captioning & Understanding

Audio Captioning

Generate natural language descriptions of audio content. The task goes beyond classification to provide detailed, contextual descriptions: "A dog barks twice, followed by a car horn in the distance."

Key datasets: AudioCaps, Clotho, WavCaps

Audio-Language Models

The latest frontier: multimodal LLMs that can understand and reason about audio. These models combine audio encoders with large language models for open-ended audio understanding.

Examples: Qwen2-Audio, SALMONN, LTU, Pengi

Audio Understanding Models

Models for audio captioning, audio question answering, and general audio understanding.

Model Performance Key Features Type Year
Qwen2-Audio
Alibaba
SOTA Multimodal LLM with audio understanding Open Source 2024
SALMONN
Tencent
Excellent Speech + Audio LLM Open Source 2024
Whisper-AT
OpenAI/Community
Good Audio tagging with Whisper encoder Open Source 2023
CLAP + GPT
Various
Good Embeddings + LLM generation Hybrid 2023

Why Transformers Dominate Audio

Audio signals have long-range dependencies. A musical phrase might span several seconds; a spoken sentence requires understanding context from start to finish.

Transformers with self-attention naturally capture these dependencies. The Audio Spectrogram Transformer (AST) treats spectrograms as images and applies Vision Transformer architecture, achieving SOTA by leveraging pretrained ImageNet weights and fine-tuning on audio.

AudioSet Challenges

Despite its size, AudioSet has known issues that affect benchmarking:

  • Label Noise: Human annotations are imperfect; ~30% of labels may have some error
  • Class Imbalance: "Speech" appears in millions of clips; rare sounds have only hundreds
  • Missing Videos: ~20% of original YouTube videos are now unavailable
  • Multi-label Complexity: Average of 2.7 labels per clip makes evaluation nuanced

Key Datasets

AudioSet

2017

2M+ human-labeled 10-second YouTube video clips covering 632 audio event classes.

Samples
2,000,000
Metric
MAP

ESC-50

2015

2,000 environmental audio recordings organized into 50 classes (animals, natural soundscapes, etc.).

Samples
2,000
Metric
ACCURACY

Summary: Which Model Should You Use?

Audio Classification

Best Overall
BEATs (Microsoft) - Highest mAP, open source, efficient tokenizer approach.
Best Practical
AST - Well-documented, HuggingFace support, easy to fine-tune.

Music Generation

Best Quality
Suno v3.5 / Udio - Professional quality vocals and production.
Best Open Source
MusicGen (Meta) - Runs locally, good for instrumentals.

Audio Understanding

Best Multimodal
Qwen2-Audio - Full audio LLM with reasoning capabilities.
For Embeddings
CLAP - Audio-text alignment for search and retrieval.

Contribute to Audio AI

Have you achieved better results on AudioSet or ESC-50? Working on novel audio generation models? Help the community by sharing your verified results.