Speech

Working with voice and audio? Evaluate speech-to-text accuracy, voice synthesis quality, and speaker identification performance.

5 tasks9 datasets40 results

Speech tech in 2025 is defined by massive foundation models trained on 500K+ hours. Whisper dominates ASR with 680K hours training. Diffusion models revolutionized TTS with sub-200ms latency. Systems now handle multilingual, accent-robust, real-time synthesis.

State of the Field (2025)

  • ASR: Whisper-Large (1.5B params, 680K hours) achieves 1.9-3.9% WER on clean speech. AssemblyAI Conformer-1 (650K hours) cuts noisy speech errors 43%. Gemini leads on accented speech via LLM integration.
  • TTS: Higgs Audio V2 (3B params, 10M hours) tops expressiveness. Deepgram Aura delivers sub-200ms latency. XTTS enables voice cloning from 6-second samples. NeuTTS Air runs on-device with 0.5B params.
  • Speaker Verification: w2v-BERT 2.0 (600M params, 450M hours across 143 languages) achieves 0.12% EER on VoxCeleb1-O. SVeritas benchmark reveals cross-language and age-mismatch vulnerabilities.
  • Architectures: Conformer dominates ASR with progressive downsampling and grouped attention (29% faster inference). Diffusion models power TTS. Self-supervised pre-training (wav2vec, WavLM) enables low-resource deployment.

Quick Recommendations

Production ASR (batch, high accuracy)

Whisper-Large or AssemblyAI Conformer-1

1.9-3.9% WER on clean speech. Whisper is open-source with broad support. Conformer-1 offers enterprise reliability and business-domain optimization.

Real-time ASR (streaming, low latency)

AWS Transcribe or AssemblyAI Streaming

Best latency-accuracy tradeoff. Whisper's 6-7% WER penalty on streaming makes it unusable for conversational AI. Managed APIs handle scaling.

Accented/technical speech ASR

Google Gemini (multimodal)

LLM integration crushes traditional ASR on accents and domain-specific terminology. World knowledge compensates for acoustic ambiguity.

Multilingual/code-switched ASR

SeamlessM4T-v2-Large

43.6% improvement on code-switched speech. Handles 143 languages. Purpose-built for mixed-language scenarios vs Whisper's general multilingual.

High-quality TTS (audiobooks, media)

Higgs Audio V2

3B params, 10M hours training. Best expressiveness and emotional modulation. Top-trending on Hugging Face for a reason.

Low-latency TTS (chatbots, IVR)

Deepgram Aura

Sub-200ms latency enables natural conversational flow. Includes speech fillers and emotional modulation. Purpose-built for real-time.

Voice cloning (minimal reference data)

XTTS-v2

6-second samples for full voice replication. Widely adopted, extensive integrations, robust across diverse speakers. Zero-shot works.

On-device TTS (mobile, IoT, privacy)

NeuTTS Air

0.5B params runs on Raspberry Pi. Near-human quality without cloud dependency. Kills latency and privacy concerns.

Speaker verification (security-critical)

w2v-BERT 2.0 based systems

0.12% EER on VoxCeleb. 450M hours training across 143 languages. Evaluate on SVeritas benchmark for real-world robustness.

Accent-robust ASR (non-native speakers)

Whisper + MAS-LoRA fine-tuning

Mixture of accent-specific LoRA experts improves unknown accents vs full fine-tuning. Parameter-efficient, reduces catastrophic forgetting.

Cost-optimized ASR (high volume)

Self-hosted Whisper on containers

Open-source eliminates per-request API costs. Accept infrastructure management responsibility for 10-100x cost savings at scale.

Multi-speaker dialogue TTS

Dia (1B-2B variants)

Dialogue-focused with laughter, sighing, nonverbal elements. Streaming architecture. 2min continuous English per output.

Tasks & Benchmarks

Speech Recognition

Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.

4 datasets20 resultsSOTA tracked

Text-to-Speech

Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness in under five years. ElevenLabs, OpenAI's TTS, and XTTS-v2 produce speech that most listeners cannot distinguish from recordings, while open models like Bark, VALL-E (Microsoft), and F5-TTS demonstrated that voice cloning from 3-second samples is now a commodity capability. The frontier has moved beyond intelligibility (solved) to prosody, emotion control, and real-time streaming at under 200ms latency for conversational AI. Evaluation remains messy — MOS (Mean Opinion Score) is subjective and expensive, and automated metrics like UTMOS only loosely correlate with human preference, making benchmark comparisons unreliable.

2 datasets11 resultsSOTA tracked

Speaker Verification

Verifying speaker identity from voice samples.

1 datasets3 resultsSOTA tracked

Speech Translation

Translating spoken audio directly to another language.

1 datasets3 resultsSOTA tracked

Voice Cloning

Replicating a speaker's voice characteristics.

1 datasets3 resultsSOTA tracked
Show all datasets and SOTA results

Speech Recognition

Common Voice2019
11.2(wer)Whisper Large-v2
FLEURS2022
LibriSpeech2015
5.2(wer-test-other)Whisper Large-v2
WildASR2025

Text-to-Speech

LJ Speech2017
4.61(mos)VALL-E 2
VCTK2019
4.36(mos)NaturalSpeech 3

Speaker Verification

VoxCeleb1-O2017
1.18(eer)ResNet-34 (AM-Softmax, VoxCeleb2)

Speech Translation

MuST-C En-De tst-COMMON2019
37.1(bleu)SeamlessM4T v2 Large

Voice Cloning

LibriTTS test-clean (Zero-Shot TTS)2019
5.9(wer)VALL-E

Honest Takes

Whisper is overhyped for production

Whisper excels on benchmarks but struggles with streaming. 6-7% WER increase vs batch processing makes real-time painful. For conversational AI or live captioning, AWS Transcribe or AssemblyAI streaming APIs deliver better latency-accuracy tradeoffs despite Whisper's fame.

Accent robustness remains embarrassing

Google's legacy ASR hits 35% WER on non-native speech while Gemini achieves 10-15%. After billions in R&D, the field still can't reliably transcribe half the world's English speakers. If your users aren't native speakers, expect to double WER.

TTS latency wars are won

Deepgram Aura's sub-200ms latency kills the 'robotic delay' problem for conversational AI. Combined with streaming synthesis (Dia, MELA-TTS), we finally have TTS that feels human-speed. The bottleneck shifted from synthesis to LLM response time.

Zero-shot voice cloning is production-ready

XTTS cloning voices from 6-second samples isn't a research demo anymore. It's deployed at scale. The ethical nightmare is here, but so is massive UX improvement for multilingual content, accessibility, and personalized experiences.

Foundation models killed task-specific speech systems

Why train separate ASR, speaker verification, and emotion recognition models? w2v-BERT 2.0 (450M hours, 143 languages) handles all tasks. SeamlessM4T does ASR, translation, and TTS in one model. Specialist systems are legacy tech.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Speech Benchmarks - CodeSOTA | CodeSOTA