STT leaderboard
Speech-to-text SOTA by mean WER: Granite Speech, Cohere Transcribe, Whisper, Parakeet, APIs and open ASR.
Read the comparison →Choose the direction first: speech-to-text for transcripts, text-to-speech for generated voices, independent evals for vendor selection, or DSP views when you need to inspect what a voice is doing.
36 STT models, 21 TTS registry rows, and 5 CodeSOTA TTS evaluation tracks tracked. STT rows rank by shared benchmark metrics; TTS is split into measured runs, reported MOS metadata, and pending evaluation tracks.
The modern ranking metric is mean WER across the 8 HF Open ASR Leaderboard datasets — see the dedicated STT leaderboard. The catalogue below is sorted by each row's reported WER (benchmark in the column); LibriSpeech test-clean is now saturated near 1–2% and shown as the historical frontier.
| # | Model | Vendor | Kind | Params | Benchmark | WER |
|---|---|---|---|---|---|---|
| 01 | Cohere Transcribe (Mar 2026) | Cohere | Open Source | 2B | LibriSpeech test-clean | 1.3 |
| 02 | Higgs Audio v3 8B STT v2 | Boson AI | Open Source | 8B | LibriSpeech test-clean | 1.3 |
| 03 | Granite Speech 4.1 2B (NAR) | IBM | Open Source | 2B | LibriSpeech test-clean | 1.3 |
| 04 | Granite Speech 4.1 2B | IBM | Open Source | 2B | LibriSpeech test-clean | 1.3 |
| 05 | Parakeet TDT 1.1B | NVIDIA | Open Source | 1.1B | LibriSpeech test-clean | 1.4 |
| 06 | Granite 4.0 1B Speech | IBM | Open Source | 1B | LibriSpeech test-clean | 1.4 |
| 07 | Granite Speech 3.3 8B | IBM | Open Source | 8B | LibriSpeech test-clean | 1.4 |
| 08 | Parakeet RNNT 1.1B | NVIDIA | Open Source | 1.1B | LibriSpeech test-clean | 1.4 |
| 09 | Canary 1B | NVIDIA | Open Source | 1B | LibriSpeech test-clean | 1.5 |
| 10 | Canary 1B Flash | NVIDIA | Open Source | 1B | LibriSpeech test-clean | 1.5 |
| 11 | AssemblyAI Universal-3 Pro | AssemblyAI | Cloud API | — | LibriSpeech test-clean | 1.5 |
| 12 | Granite Speech 3.3 2B | IBM | Open Source | 2B | LibriSpeech test-clean | 1.5 |
| 13 | ElevenLabs Scribe v2 | ElevenLabs | Cloud API | — | LibriSpeech test-clean | 1.5 |
| 14 | Voxtral Small 24B | Mistral AI | Open Source | 24B | LibriSpeech test-clean | 1.6 |
| 15 | Canary-Qwen-2.5B | NVIDIA | Open Source | 2.5B | LibriSpeech test-clean | 1.6 |
| 16 | Zoom Scribe v1 | Zoom | Cloud API | — | LibriSpeech test-clean | 1.6 |
| 17 | Qwen3-ASR-1.7B | Alibaba | Open Source | 1.7B | LibriSpeech test-clean | 1.6 |
| 18 | Phi-4 Multimodal Instruct | Microsoft | Open Source | 6B | LibriSpeech test-clean | 1.7 |
| 19 | Parakeet TDT 0.6B v2 | NVIDIA | Open Source | 0.6B | LibriSpeech test-clean | 1.7 |
| 20 | Pulse Pro | Smallest AI | Cloud API | — | LibriSpeech test-clean | 1.8 |
| 21 | Voxtral Mini 3B | Mistral AI | Open Source | 3B | LibriSpeech test-clean | 1.9 |
| 22 | Conformer XL | Research | 600M | LibriSpeech test-clean | 2.0 | |
| 23 | Whisper Large v3 | OpenAI | Open Source | 1.55B | LibriSpeech test-clean | 2.0 |
| 24 | Google Chirp 3 | Cloud API | — | LibriSpeech test-clean | 2.0 | |
| 25 | Whisper Large v3 Turbo | OpenAI | Open Source | 809M | LibriSpeech test-clean | 2.1 |
| 26 | Deepgram Nova-3 | Deepgram | Cloud API | — | LibriSpeech test-clean | 2.2 |
| 27 | Voxtral Large | Mistral AI | Cloud API | — | LibriSpeech test-clean | 2.3 |
| 28 | Gladia v2 | Gladia | Cloud API | — | LibriSpeech test-clean | 2.5 |
| 29 | Speechmatics Flow | Speechmatics | Cloud API | — | LibriSpeech test-clean | 2.6 |
| 30 | Groq Whisper | Groq | Cloud API | 1.55B | LibriSpeech test-clean | 2.7 |
| 31 | Google USM | Cloud API | 2B | LibriSpeech test-clean | 2.8 | |
| 32 | Gemini 3 Pro (audio) | Cloud API | — | AA-WER v2.0 | 2.9 | |
| 33 | Azure Speech | Microsoft | Cloud API | — | LibriSpeech test-clean | 3.0 |
| 34 | Moonshine Base | Useful Sensors | Open Source | 61M | LibriSpeech test-clean | 3.5 |
| 35 | wav2vec 2.0 | Meta | Open Source | 317M | LibriSpeech test-clean | 3.8 |
| 36 | Gradium Speech-to-Text | Gradium | Cloud API | — | AA-WER v2 | 8.5 |
Measured CodeSOTA runs rank on the dedicated TTS leaderboard. Reported MOS remains useful metadata, but it is not a universal SOTA ranking. Preference and controllability tracks stay unranked until CodeSOTA has its own artifacts.
These tracks define what CodeSOTA needs before it calls a TTS model SOTA. Only the information-fidelity track has measured rows today. Naturalness preference, realtime behavior, controllability, and long-form stability stay unranked until CodeSOTA runs the same prompts and publishes artifacts.
| Track | Status | Metric | Scope | Evidence |
|---|---|---|---|---|
| Naturalness preference | live | blind pairwise win rate / Elo | same prompts, neutral A/B labels, matched voices where possible | |
| Realtime agents | planned | p50/p95 TTFA, interruption behavior, streaming chunk quality | voice-agent prompts with short turns and barge-in cases | |
| Information fidelity | live | WER, CER, critical entity accuracy, severe errors | hard text: numbers, URLs, dates, names, acronyms, product codes | |
| Controllability | planned | style/emotion/tag adherence plus acoustic movement | emotion, pace, pauses, whisper, emphasis, multi-speaker instructions | |
| Long-form stability | planned | voice drift, omissions, repetitions, chapter-level artifact rate | 5-15 minute narration and dialogue prompts |
| Model | Vendor | Kind | Verification | Params | MOS | MOS note |
|---|---|---|---|---|---|---|
| ElevenLabs Turbo v2.5 | ElevenLabs | Cloud API | vendor reported | — | 4.8 | within MOS noise |
| Sesame CSM | Sesame | Open Source | community reported | 1B+ | 4.7 | within MOS noise |
| OpenAI TTS HD | OpenAI | Cloud API | vendor reported | — | 4.7 | within MOS noise |
| Gemini 2.5 Pro TTS | Cloud API | vendor reported | — | 4.7 | within MOS noise | |
| Cartesia Sonic 2 | Cartesia | Cloud API | vendor reported | — | 4.7 | within MOS noise |
| ElevenLabs Flash v2.5 | ElevenLabs | Cloud API | vendor reported | — | 4.6 | reported MOS; no CodeSOTA CI |
| PlayHT 3.0 | PlayHT | Cloud API | vendor reported | — | 4.6 | reported MOS; no CodeSOTA CI |
| Fish Audio S2 Pro | Fish Audio | Open Source | paper reported | 5B | 4.6 | reported MOS; no CodeSOTA CI |
| Orpheus TTS | Canopy Labs | Open Source | community reported | 3B | 4.6 | reported MOS; no CodeSOTA CI |
| Gemini 2.5 Flash TTS | Cloud API | vendor reported | — | 4.5 | reported MOS; no CodeSOTA CI | |
| Kokoro v1.0 | Hexgrad | Open Source | codesota measured | 82M | 4.5 | no CI yet; measured run exposes sample count and artifacts |
| XTTS v2 | Coqui | Open Source | paper reported | 467M | 4.5 | reported MOS; no CodeSOTA CI |
| Google Chirp 3 HD | Cloud API | vendor reported | — | 4.4 | reported MOS; no CodeSOTA CI | |
| Gradium TTS | Gradium | Cloud API | codesota measured | — | 4.4 | no CI yet; measured run exposes sample count and artifacts |
Gradium is tracked as a hosted TTS API with 4.4 MOS in the catalog and a separate intelligibility run: 13.4% normalized WER, 73.3% critical-entity accuracy, and 299 ms p95 first-byte latency on 30 hard English prompts.
Read the Gradium intelligibility run →Long-form reads for the common decisions: which commercial TTS, which open-source, which model fits podcasts, audiobooks, voice bots or cloning.
Speech-to-text SOTA by mean WER: Granite Speech, Cohere Transcribe, Whisper, Parakeet, APIs and open ASR.
Read the comparison →CodeSOTA-measured text-to-speech runs with artifacts, configs, transcripts, and hashes.
Read the comparison →Hosted TTS APIs and open-source voice models in one procurement directory.
Read the comparison →Gradium and Kokoro on hard English prompts: WER, entity preservation, latency and cost.
Read the comparison →Flagship commercial TTS head-to-head: quality, cost, latency, voice library.
Read the comparison →Quality leader against the purpose-built low-latency challenger.
Read the comparison →Hyperscaler comparison — pricing, voices, SSML, streaming.
Read the comparison →Long-form naturalness ranked: pacing, breath, intonation over 30+ minutes.
Read the comparison →SSML, character voices, consistency across chapters.
Read the comparison →TTFB under 200ms: Cartesia, ElevenLabs Flash, Gemini Flash.
Read the comparison →Zero-shot similarity, data requirements, and consent-ethics framing.
Read the comparison →Kokoro, Sesame CSM, Orpheus, F5-TTS, Dia — licensed and deployable.
Read the comparison →Eleven open-source TTS voices, the same prompt, rendered through five DSP lenses and Griffin-Lim resynthesis. A reproducible walkthrough of the representations that vocoders, ASR systems and human ears actually read — mel spectrograms, MFCC, F0, formants.
Every figure is generated from the same code path; every voice is labelled with its provenance. No fabricated spectrograms, no stock audio. If the sample cannot be reproduced, it doesn't appear.
Canonical for each direction plus the community-adopted follow-ups. LibriSpeech, Common Voice and VCTK are canonicalised in our dataset registry; FLEURS, AudioBench and EARS are tracked qualitatively pending canonicalisation.
Rows with a mark live in the registry and carry full lineage.
| Benchmark | Scope | Primary metric | Year | Source | |
|---|---|---|---|---|---|
| LibriSpeech | Speech-to-Text | wer-test-clean | 2015 | link → | |
| Common Voice | Speech-to-Text | wer | 2019 | link → | |
| LJ Speech | Text-to-Speech | mos | 2017 | link → | |
| VCTK | Text-to-Speech | mos | 2019 | link → | |
| TTS Intelligibility | Text-to-Speech | critical-entity-accuracy | 2026 | link → | |
| FLEURS | Speech-to-Text | WER (per-lang) | 2022 | link → | |
| AudioBench | Audio-LLM | composite | 2024 | link → | |
| EARS | Text-to-Speech | MOS · subjective | 2024 | link → |
Modern speech recognition takes raw audio into mel-spectrogram features, runs them through a Conformer or Transformer encoder, and decodes with CTC, RNNT or attention. Post-processing — language-model rescoring, punctuation, diarisation — yields the final transcript.
Modern speech synthesis runs the pipeline in reverse. Text is embedded by a language model; acoustic tokens are predicted autoregressively or by flow matching; a vocoder or neural codec decodes those tokens back to waveform. The neural audio codec — EnCodec, SoundStream, Mimi — is the hinge that lets TTS borrow the tooling of LLMs.
What changed recently is the representation. Once audio could be tokenised, every architectural trick from text generation became available to speech: pretraining, instruction-tuning, prompted style control, zero-shot cloning. That is why the open-source gap in TTS closed so quickly after 2023.
On the STT side, the Conformer block — self-attention plus convolution — is still the workhorse. Whisper took a different path with a pure Transformer encoder-decoder trained on weak supervision at scale, trading some efficiency for massive multilingual coverage.
Other modality hubs on Codesota worth reading next.