Audio

Processing general audio signals? Test your models on sound classification, event detection, music analysis, and source separation.

7 tasks10 datasets9 results

Audio AI in 2025 has shifted from task-specific models to unified foundation approaches. Whisper dominates ASR with 680K hours training. Suno and Udio democratized music generation with 100K+ songs created. Google's MSEB benchmark exposed substantial gaps in current audio understanding.

State of the Field (2025)

Music Generation: Suno v4.5 and Udio enable full-length 4-minute songs from text, with 100K+ user-generated tracks analyzed. Stable Audio excels at instrumental loops and soundbeds. AI music market projected to hit $38.7B by 2033 (25.8% CAGR).
Speech Recognition: Whisper (680K hours, multilingual) achieves 50% fewer errors than specialized models on diverse datasets. mHuBERT-147 (95M params, 90K hours) ranks first on ML-SUPERB while outperforming 1B parameter models. FunASR's SenseVoice (234M params) handles 5 languages with emotion recognition.
Audio Classification: FAST achieves 0.448 mAP on AudioSet with 150x fewer parameters than competing transformers. Cochleagram representations yield 5.16% improvement on sound event detection vs spectrograms. Audio Spectrogram Transformer (AST) hits 98.12% accuracy on Speech Commands v2.
Benchmarks: MSEB (NeurIPS 2025) unified evaluation across 8 audio capabilities (voice search, reasoning, retrieval, classification) reveals substantial performance gaps. Semantic bottlenecks from ASR stages universally constrain language-content tasks. Cross-modal grounding remains a critical weakness.

Quick Recommendations

Music generation (commercial release)

Suno v4.5 for full songs, Stable Audio for instrumentals

Suno enables 4-minute songs with vocals, improved transitions in v4.5. Stable Audio offers cleanest IP clarity and best instrumental quality for background tracks. Both allow commercial use with proper licensing.

Speech recognition (multilingual, robust)

Whisper (base/turbo) or mHuBERT-147

Whisper excels on accents and noise (50% fewer errors on diverse datasets). mHuBERT-147 provides 95M param efficiency while outperforming 1B models - ideal for mobile deployment.

Audio classification (edge deployment)

FAST architecture

Competitive AudioSet performance (0.448 mAP) with 150x fewer parameters than AST. Combines CNNs with transformers for efficient feature extraction. Runs on resource-constrained devices.

Text-to-audio generation (research, custom domains)

AudioLDM

Single-GPU trainability with zero-shot manipulation capabilities. Open-source enables fine-tuning on custom datasets for domain-specific generation (game sound effects, meditation soundscapes).

Multimodal audio understanding

Qwen2-Audio

Strong performance across audio understanding benchmarks with audio-text conversation capabilities. Integrates with Qwen language model ecosystem. Fine-tune for domain-specific tasks like music information retrieval.

Speaker separation (broadcast quality)

AudioShake

State-of-the-art high-fidelity multi-speaker separation for hours-long recordings. Essential for post-production, podcast transcription with diarization, and voice AI requiring clean separated tracks.

Production ASR toolkit (comprehensive features)

FunASR with SenseVoice

234M params, 300K hours training. Handles ASR, voice activity detection, punctuation, speaker diarization, emotion recognition across 5 languages (Mandarin, Cantonese, English, Japanese, Korean). Open-source.

Audio content authentication

WaveVerify watermarking

Robust against perturbations and attacks (NeurIPS 2024). Critical for financial services and healthcare where voice fraud poses $10M+ risks. Use AudioMarkBench to evaluate robustness requirements.

Comprehensive audio evaluation

MSEB benchmark framework

Evaluates 8 core audio capabilities (semantic and acoustic tasks) across curated datasets. Reveals performance gaps before deployment. Test beyond domain-specific benchmarks for production robustness.

Low-latency voice agents

Custom VAD + streaming synthesis + model routing

Engineering optimization matters more than raw model quality. Implement concurrent reasoning, adaptive model selection (route simple tasks to efficient models), and streaming TTS to minimize Time to First Audio.

Tasks & Benchmarks

Music Generation

Generating music from text, audio, or other inputs.

1 datasets3 resultsSOTA tracked

Audio Captioning

Generating text descriptions of audio content.

1 datasets3 resultsSOTA tracked

Sound Event Detection

Detecting and localizing sound events in audio.

1 datasets3 resultsSOTA tracked

Text-to-Audio

Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.

1 datasets0 results

Voice Activity Detection

Voice activity detection (VAD) answers the deceptively simple question "is someone speaking right now?" — and getting it wrong ruins everything downstream in speech pipelines. Silero VAD became the open-source standard by shipping a model under 2MB that runs in real-time on CPU with >95% accuracy, while pyannote.audio's segmentation model pushed the state of the art for overlapping speech detection. Production VAD must handle extreme conditions: background music, crowd noise, whispered speech, and non-speech vocalizations (coughs, laughs) that fool simpler models. Modern systems increasingly combine VAD with speaker diarization ("who spoke when") in unified models, and the rise of real-time conversational AI has made sub-100ms latency VAD a critical infrastructure component.

2 datasets0 results

Audio Classification

Audio classification identifies what's happening in a sound — music genre, environmental sounds, speaker emotion, language identification — and underpins everything from content moderation to smart home devices. Audio Spectrogram Transformer (AST) and BEATs brought ImageNet-level transfer learning to audio by treating spectrograms as images, achieving >95% accuracy on AudioSet's 527-class ontology. The paradigm shifted with audio foundation models like CLAP (contrastive language-audio pretraining) and Whisper's encoder, which provide general-purpose audio representations that transfer to downstream tasks with minimal fine-tuning. The hard problems remain fine-grained classification in noisy real-world conditions, rare sound event detection with few examples, and efficient on-device inference for always-listening applications.

2 datasets0 results

Audio-to-Audio

Audio-to-audio encompasses speech enhancement, voice conversion, source separation, and style transfer — any task where audio goes in and transformed audio comes out. Speech enhancement (denoising) was revolutionized by Meta's Demucs and Microsoft's DCCRN, now used in every video call; voice conversion took a leap with RVC and So-VITS-SVC enabling zero-shot voice cloning that sparked both creative tools and deepfake concerns. Source separation (isolating vocals, drums, bass from a mix) reached near-production quality with HTDemucs and Band-Split RNN, making stems extraction a solved problem for most music. The field is converging toward unified models that handle multiple audio transformations through natural language instructions, blurring the line with text-to-audio generation.

2 datasets0 results

Show all datasets and SOTA results

Music Generation

MusicCaps2023

4(fad)MusicLM

Audio Captioning

AudioCaps2019

0.37(spider)AudioCaps baseline (TopDown+Align)

Sound Event Detection

DESEDDomestic Environment Sound Event Detection (DCASE Task 4)2020

58.1(event-f1)ATST-SED

Text-to-Audio

AudioCaps (T2A)AudioCaps — Text-to-Audio Generation Benchmark2023

Voice Activity Detection

AVA-Speech2018

DIHARD2018

Audio Classification

AudioSetAudioSet2017

ESC-50Environmental Sound Classification 502015

Audio-to-Audio

DNS Challenge2020

VCTK (Voice Conversion)2019

Honest Takes

Music gen platforms have IP landmines

Suno and Udio enable commercial use, but legal status of training data remains murky. Stable Audio offers cleanest IP clarity for commercial work. If you're producing background music or loops, Stable Audio's instrumental focus simplifies legal compliance vs vocal generation.

MSEB exposed how far we are from universal audio intelligence

Google's comprehensive benchmark revealed current models fall substantially short on all 8 core audio tasks. ASR stages universally bottleneck semantic understanding. Models trained on clean audio collapse under real-world noise and reverberation. We're nowhere near human-level audio understanding.

Lightweight models are production-ready

FAST achieves competitive AudioSet performance (0.448 mAP) with 150x fewer parameters. mHuBERT-147 beats 1B parameter models while fitting on mobile devices. Stop deploying cloud-only systems - edge deployment is viable now for most audio tasks.

Cultural bias is embarrassing

CMI-Bench shows 80%+ performance on Western pop, but models collapse on non-Western genres (Bossanova, Celtic, Medieval). Training data concentrated on Western music creates systems useless for cross-cultural audio understanding. Test on your target demographics.

Audio watermarking is critical infrastructure now

Deepfake fraud attempts up 1,300% from 2023 to 2024, with $10M+ losses to voice scams. WaveVerify enables robust watermarking against attacks. If you're deploying TTS or voice synthesis, watermarking isn't optional anymore - it's liability protection.

Self-supervised learning killed labeled data requirements

wav2vec 2.0 achieves 4.8/8.2 WER using only 10 minutes of labeled data plus 53K hours unlabeled. Stop spending on expensive manual annotation - self-supervised pretraining delivers superior representations at fraction of the cost.

Get notified when these results update

New models drop weekly. We track them so you don't have to.