Speech Recognition Benchmarks
How automatic speech recognition evaluation evolved from clean read speech on LibriSpeech, through multi-speaker and noisy conditions, toward naturalistic and multilingual benchmarks that reflect real deployment environments. The spine tracks where word error rate evaluation moved as clean-speech performance saturated; branches cover speaker verification (VoxCeleb), noisy conditions (LibriSpeech-other, GigaSpeech), and multilingual evaluation (FLEURS, Common Voice).
LibriSpeech test-clean has been effectively solved — modern end-to-end systems achieve 1.5–2% WER, near the transcription noise floor. The field's response has been to test harder conditions: multi-speaker meetings (CHiME-6), accented and code-switched speech (Common Voice), and genuinely unconstrained real-world audio (WildASR). FLEURS brought multilingual coverage to 102 languages and is now the standard for evaluating speech foundation models like Whisper. The active frontier as of 2025 is naturalistic multi-speaker diarization + transcription — a task where no current system is close to human parity on challenging domains.
Attention path plus branches.
Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.
Nodes in detail.
LibriSpeech
1,000 hours of English audiobook speech split into clean and other (noisy) test sets. Defined ASR evaluation for the deep-learning era. test-clean WER under 2% for strong systems; test-other under 4%. Both effectively saturated for top models.
VoxCeleb
100K+ utterances from 1,251 celebrities scraped from YouTube. VoxCeleb2 expanded to 6,112 identities. The standard speaker verification benchmark; equal error rate (EER) is the metric. Active as a speaker-modelling benchmark even as ASR has moved on.
CHiME-6
20 dinner-party sessions recorded with distant microphones; multi-speaker, naturally overlapping speech with realistic noise. WER for systems without oracle diarization exceeds 50% for most participants. Exposed the gap between clean-speech WER progress and real conversational ASR.
Common Voice
Crowdsourced multilingual speech covering 100+ languages, many low-resource. Accent diversity within English makes it a harder distribution shift test than LibriSpeech. Primary use: multilingual and low-resource ASR evaluation, not English-only benchmarking.
GigaSpeech
10,000 hours of transcribed English from audiobooks, podcasts, and YouTube. Larger and more diverse than LibriSpeech; tests model robustness to domain and acoustic variation across sources.
FLEURS
102-language speech benchmark derived from FLoRes translation pairs. Covers many low-resource languages not represented in LibriSpeech or Common Voice. The standard benchmark for evaluating multilingual speech foundation models like Whisper.
WildASR
Naturalistic audio from diverse real-world environments — phone calls, live events, spontaneous conversation. Designed to expose failure modes that clean-speech benchmarks mask. The emerging standard for assessing deployment readiness of ASR systems.