Speech
Working with voice and audio? Evaluate speech-to-text accuracy, voice synthesis quality, and speaker identification performance.
Speech tech in 2025 is defined by massive foundation models trained on 500K+ hours. Whisper dominates ASR with 680K hours training. Diffusion models revolutionized TTS with sub-200ms latency. Systems now handle multilingual, accent-robust, real-time synthesis.
State of the Field (2025)
- ASR: Whisper-Large (1.5B params, 680K hours) achieves 1.9-3.9% WER on clean speech. AssemblyAI Conformer-1 (650K hours) cuts noisy speech errors 43%. Gemini leads on accented speech via LLM integration.
- TTS: Higgs Audio V2 (3B params, 10M hours) tops expressiveness. Deepgram Aura delivers sub-200ms latency. XTTS enables voice cloning from 6-second samples. NeuTTS Air runs on-device with 0.5B params.
- Speaker Verification: w2v-BERT 2.0 (600M params, 450M hours across 143 languages) achieves 0.12% EER on VoxCeleb1-O. SVeritas benchmark reveals cross-language and age-mismatch vulnerabilities.
- Architectures: Conformer dominates ASR with progressive downsampling and grouped attention (29% faster inference). Diffusion models power TTS. Self-supervised pre-training (wav2vec, WavLM) enables low-resource deployment.
Quick Recommendations
Production ASR (batch, high accuracy)
Whisper-Large or AssemblyAI Conformer-1
1.9-3.9% WER on clean speech. Whisper is open-source with broad support. Conformer-1 offers enterprise reliability and business-domain optimization.
Real-time ASR (streaming, low latency)
AWS Transcribe or AssemblyAI Streaming
Best latency-accuracy tradeoff. Whisper's 6-7% WER penalty on streaming makes it unusable for conversational AI. Managed APIs handle scaling.
Accented/technical speech ASR
Google Gemini (multimodal)
LLM integration crushes traditional ASR on accents and domain-specific terminology. World knowledge compensates for acoustic ambiguity.
Multilingual/code-switched ASR
SeamlessM4T-v2-Large
43.6% improvement on code-switched speech. Handles 143 languages. Purpose-built for mixed-language scenarios vs Whisper's general multilingual.
High-quality TTS (audiobooks, media)
Higgs Audio V2
3B params, 10M hours training. Best expressiveness and emotional modulation. Top-trending on Hugging Face for a reason.
Low-latency TTS (chatbots, IVR)
Deepgram Aura
Sub-200ms latency enables natural conversational flow. Includes speech fillers and emotional modulation. Purpose-built for real-time.
Voice cloning (minimal reference data)
XTTS-v2
6-second samples for full voice replication. Widely adopted, extensive integrations, robust across diverse speakers. Zero-shot works.
On-device TTS (mobile, IoT, privacy)
NeuTTS Air
0.5B params runs on Raspberry Pi. Near-human quality without cloud dependency. Kills latency and privacy concerns.
Speaker verification (security-critical)
w2v-BERT 2.0 based systems
0.12% EER on VoxCeleb. 450M hours training across 143 languages. Evaluate on SVeritas benchmark for real-world robustness.
Accent-robust ASR (non-native speakers)
Whisper + MAS-LoRA fine-tuning
Mixture of accent-specific LoRA experts improves unknown accents vs full fine-tuning. Parameter-efficient, reduces catastrophic forgetting.
Cost-optimized ASR (high volume)
Self-hosted Whisper on containers
Open-source eliminates per-request API costs. Accept infrastructure management responsibility for 10-100x cost savings at scale.
Multi-speaker dialogue TTS
Dia (1B-2B variants)
Dialogue-focused with laughter, sighing, nonverbal elements. Streaming architecture. 2min continuous English per output.
Tasks & Benchmarks
Speech Recognition
Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.
Text-to-Speech
Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness in under five years. ElevenLabs, OpenAI's TTS, and XTTS-v2 produce speech that most listeners cannot distinguish from recordings, while open models like Bark, VALL-E (Microsoft), and F5-TTS demonstrated that voice cloning from 3-second samples is now a commodity capability. The frontier has moved beyond intelligibility (solved) to prosody, emotion control, and real-time streaming at under 200ms latency for conversational AI. Evaluation remains messy — MOS (Mean Opinion Score) is subjective and expensive, and automated metrics like UTMOS only loosely correlate with human preference, making benchmark comparisons unreliable.
Speaker Verification
Verifying speaker identity from voice samples.
Speech Translation
Translating spoken audio directly to another language.
Voice Cloning
Replicating a speaker's voice characteristics.
Show all datasets and SOTA results
Speech Recognition
Speaker Verification
Speech Translation
Voice Cloning
Honest Takes
Whisper is overhyped for production
Whisper excels on benchmarks but struggles with streaming. 6-7% WER increase vs batch processing makes real-time painful. For conversational AI or live captioning, AWS Transcribe or AssemblyAI streaming APIs deliver better latency-accuracy tradeoffs despite Whisper's fame.
Accent robustness remains embarrassing
Google's legacy ASR hits 35% WER on non-native speech while Gemini achieves 10-15%. After billions in R&D, the field still can't reliably transcribe half the world's English speakers. If your users aren't native speakers, expect to double WER.
TTS latency wars are won
Deepgram Aura's sub-200ms latency kills the 'robotic delay' problem for conversational AI. Combined with streaming synthesis (Dia, MELA-TTS), we finally have TTS that feels human-speed. The bottleneck shifted from synthesis to LLM response time.
Zero-shot voice cloning is production-ready
XTTS cloning voices from 6-second samples isn't a research demo anymore. It's deployed at scale. The ethical nightmare is here, but so is massive UX improvement for multilingual content, accessibility, and personalized experiences.
Foundation models killed task-specific speech systems
Why train separate ASR, speaker verification, and emotion recognition models? w2v-BERT 2.0 (450M hours, 143 languages) handles all tasks. SeamlessM4T does ASR, translation, and TTS in one model. Specialist systems are legacy tech.
Get notified when these results update
New models drop weekly. We track them so you don't have to.