Codesota · Speech · Vol. IIThe register of speech-to-text and text-to-speechIssue: May 19, 2026
§ 00 · Speech

Speech router

Choose the direction first: speech-to-text for transcripts, text-to-speech for generated voices, independent evals for vendor selection, or DSP views when you need to inspect what a voice is doing.

36 STT models, 21 TTS registry rows, and 5 CodeSOTA TTS evaluation tracks tracked. STT rows rank by shared benchmark metrics; TTS is split into measured runs, reported MOS metadata, and pending evaluation tracks.

Transcribe
audio → words
compare by WER · streaming latency
Synthesize
text → voice
compare by MOS · intelligibility
Evaluate
voice → hard prompts
compare by WER · entities · cost
Inspect
voice → spectrogram
compare by F0 · MFCC · centroid
§ 01 · Speech-to-text

Word error rate, ranked.

The modern ranking metric is mean WER across the 8 HF Open ASR Leaderboard datasets — see the dedicated STT leaderboard. The catalogue below is sorted by each row's reported WER (benchmark in the column); LibriSpeech test-clean is now saturated near 1–2% and shown as the historical frontier.


Metric
WER · lower is better
Models
36 tracked · all shown
Dataset
Per row · see column
Full guide · speech recognition →
Tracked STT · May 2026
Shaded row marks current SOTA
#ModelVendorKindParamsBenchmarkWER
01Cohere Transcribe (Mar 2026)CohereOpen Source2BLibriSpeech test-clean1.3
02Higgs Audio v3 8B STT v2Boson AIOpen Source8BLibriSpeech test-clean1.3
03Granite Speech 4.1 2B (NAR)IBMOpen Source2BLibriSpeech test-clean1.3
04Granite Speech 4.1 2BIBMOpen Source2BLibriSpeech test-clean1.3
05Parakeet TDT 1.1BNVIDIAOpen Source1.1BLibriSpeech test-clean1.4
06Granite 4.0 1B SpeechIBMOpen Source1BLibriSpeech test-clean1.4
07Granite Speech 3.3 8BIBMOpen Source8BLibriSpeech test-clean1.4
08Parakeet RNNT 1.1BNVIDIAOpen Source1.1BLibriSpeech test-clean1.4
09Canary 1BNVIDIAOpen Source1BLibriSpeech test-clean1.5
10Canary 1B FlashNVIDIAOpen Source1BLibriSpeech test-clean1.5
11AssemblyAI Universal-3 ProAssemblyAICloud APILibriSpeech test-clean1.5
12Granite Speech 3.3 2BIBMOpen Source2BLibriSpeech test-clean1.5
13ElevenLabs Scribe v2ElevenLabsCloud APILibriSpeech test-clean1.5
14Voxtral Small 24BMistral AIOpen Source24BLibriSpeech test-clean1.6
15Canary-Qwen-2.5BNVIDIAOpen Source2.5BLibriSpeech test-clean1.6
16Zoom Scribe v1ZoomCloud APILibriSpeech test-clean1.6
17Qwen3-ASR-1.7BAlibabaOpen Source1.7BLibriSpeech test-clean1.6
18Phi-4 Multimodal InstructMicrosoftOpen Source6BLibriSpeech test-clean1.7
19Parakeet TDT 0.6B v2NVIDIAOpen Source0.6BLibriSpeech test-clean1.7
20Pulse ProSmallest AICloud APILibriSpeech test-clean1.8
21Voxtral Mini 3BMistral AIOpen Source3BLibriSpeech test-clean1.9
22Conformer XLGoogleResearch600MLibriSpeech test-clean2.0
23Whisper Large v3OpenAIOpen Source1.55BLibriSpeech test-clean2.0
24Google Chirp 3GoogleCloud APILibriSpeech test-clean2.0
25Whisper Large v3 TurboOpenAIOpen Source809MLibriSpeech test-clean2.1
26Deepgram Nova-3DeepgramCloud APILibriSpeech test-clean2.2
27Voxtral LargeMistral AICloud APILibriSpeech test-clean2.3
28Gladia v2GladiaCloud APILibriSpeech test-clean2.5
29Speechmatics FlowSpeechmaticsCloud APILibriSpeech test-clean2.6
30Groq WhisperGroqCloud API1.55BLibriSpeech test-clean2.7
31Google USMGoogleCloud API2BLibriSpeech test-clean2.8
32Gemini 3 Pro (audio)GoogleCloud APIAA-WER v2.02.9
33Azure SpeechMicrosoftCloud APILibriSpeech test-clean3.0
34Moonshine BaseUseful SensorsOpen Source61MLibriSpeech test-clean3.5
35wav2vec 2.0MetaOpen Source317MLibriSpeech test-clean3.8
36Gradium Speech-to-TextGradiumCloud APIAA-WER v28.5
LibriSpeech test-clean frontier
Global best-so-far WER, not per-model provider history
1.0%1.8%2.5%3.3%4.1%202020232026WER · lower is better
Fig 1 · WER or AA-WER as labelled in the benchmark column. The chart shows the global best-so-far trend for the LibriSpeech test-clean category, not a per-row provider history.
§ 02 · Text-to-speech

TTS evidence, split.

Measured CodeSOTA runs rank on the dedicated TTS leaderboard. Reported MOS remains useful metadata, but it is not a universal SOTA ranking. Preference and controllability tracks stay unranked until CodeSOTA has its own artifacts.


Metric
Measured entity accuracy · reported MOS metadata
Models
21 tracked · 2 measured
Evidence split
CodeSOTA runs · planned preference tracks · reported MOS registry
Measured TTS leaderboard →
Reported TTS registry →
Registry sample · May 2026
Measured rows are highlighted; reported MOS is not rank
CodeSOTA TTS evaluation tracks
Missing-model backlog →

These tracks define what CodeSOTA needs before it calls a TTS model SOTA. Only the information-fidelity track has measured rows today. Naturalness preference, realtime behavior, controllability, and long-form stability stay unranked until CodeSOTA runs the same prompts and publishes artifacts.

TrackStatusMetricScopeEvidence
Naturalness preferenceliveblind pairwise win rate / Elosame prompts, neutral A/B labels, matched voices where possible
codesota measured
First anonymous blind Elo study is live on /text-to-speech/elo; treat rankings as provisional until vote volume grows.
Realtime agentsplannedp50/p95 TTFA, interruption behavior, streaming chunk qualityvoice-agent prompts with short turns and barge-in cases
codesota planned
Separate from naturalness because low-latency systems fail differently.
Information fidelityliveWER, CER, critical entity accuracy, severe errorshard text: numbers, URLs, dates, names, acronyms, product codes
codesota measured
Current measured rows live on the TTS measured leaderboard.
Controllabilityplannedstyle/emotion/tag adherence plus acoustic movementemotion, pace, pauses, whisper, emphasis, multi-speaker instructions
codesota planned
Provider control claims stay metadata until tested on shared prompts.
Long-form stabilityplannedvoice drift, omissions, repetitions, chapter-level artifact rate5-15 minute narration and dialogue prompts
codesota planned
Needed before podcast or audiobook recommendations are treated as benchmark-backed.
ModelVendorKindVerificationParamsMOSMOS note
ElevenLabs Turbo v2.5ElevenLabsCloud APIvendor reported4.8within MOS noise
Sesame CSMSesameOpen Sourcecommunity reported1B+4.7within MOS noise
OpenAI TTS HDOpenAICloud APIvendor reported4.7within MOS noise
Gemini 2.5 Pro TTSGoogleCloud APIvendor reported4.7within MOS noise
Cartesia Sonic 2CartesiaCloud APIvendor reported4.7within MOS noise
ElevenLabs Flash v2.5ElevenLabsCloud APIvendor reported4.6reported MOS; no CodeSOTA CI
PlayHT 3.0PlayHTCloud APIvendor reported4.6reported MOS; no CodeSOTA CI
Fish Audio S2 ProFish AudioOpen Sourcepaper reported5B4.6reported MOS; no CodeSOTA CI
Orpheus TTSCanopy LabsOpen Sourcecommunity reported3B4.6reported MOS; no CodeSOTA CI
Gemini 2.5 Flash TTSGoogleCloud APIvendor reported4.5reported MOS; no CodeSOTA CI
Kokoro v1.0HexgradOpen Sourcecodesota measured82M4.5no CI yet; measured run exposes sample count and artifacts
XTTS v2CoquiOpen Sourcepaper reported467M4.5reported MOS; no CodeSOTA CI
Google Chirp 3 HDGoogleCloud APIvendor reported4.4reported MOS; no CodeSOTA CI
Gradium TTSGradiumCloud APIcodesota measured4.4no CI yet; measured run exposes sample count and artifacts
Text-to-speech MOS frontier
Global best-so-far MOS, not per-model provider history
3.43.84.24.65.0202320252026MOS · higher is better
Fig 2 · MOS is subjective. The chart shows the global best-so-far trend; vendors publish different listener panels and reference tracks; direct comparison below 0.1 should be treated as noise.
Measured CodeSOTA run · Gradium

Gradium is tracked as a hosted TTS API with 4.4 MOS in the catalog and a separate intelligibility run: 13.4% normalized WER, 73.3% critical-entity accuracy, and 299 ms p95 first-byte latency on 30 hard English prompts.

Read the Gradium intelligibility run →
§ 03 · Comparison pages

Pairwise, and by use-case.

Long-form reads for the common decisions: which commercial TTS, which open-source, which model fits podcasts, audiobooks, voice bots or cloning.

Fig 3 · Each comparison page has its own evidence table; these are editorial reads, not benchmark duplicates.
§ 04 · Featured deep-dive

How speech becomes a picture.

Eleven open-source TTS voices, the same prompt, rendered through five DSP lenses and Griffin-Lim resynthesis. A reproducible walkthrough of the representations that vocoders, ASR systems and human ears actually read — mel spectrograms, MFCC, F0, formants.

Every figure is generated from the same code path; every voice is labelled with its provenance. No fabricated spectrograms, no stock audio. If the sample cannot be reproduced, it doesn't appear.

§ 05 · Benchmarks

The datasets we believe.

Canonical for each direction plus the community-adopted follow-ups. LibriSpeech, Common Voice and VCTK are canonicalised in our dataset registry; FLEURS, AudioBench and EARS are tracked qualitatively pending canonicalisation.

Rows with a mark live in the registry and carry full lineage.

BenchmarkScopePrimary metricYearSource
LibriSpeechSpeech-to-Textwer-test-clean2015link →
Common VoiceSpeech-to-Textwer2019link →
LJ SpeechText-to-Speechmos2017link →
VCTKText-to-Speechmos2019link →
TTS IntelligibilityText-to-Speechcritical-entity-accuracy2026link →
FLEURSSpeech-to-TextWER (per-lang)2022link →
AudioBenchAudio-LLMcomposite2024link →
EARSText-to-SpeechMOS · subjective2024link →
Fig 5 · Solid marker = canonicalised in the Codesota registry. Hollow marker = widely cited, tracked qualitatively, not yet graded.
ASR · LibriSpeech
202326
1.25WER, ↓
TTS · naturalness
202326
4.8MOS, ↑
Realtime TTS
202326
~90ms TTFB, ↓
Open-source TTS
202326
4.7MOS, ↑
Fig 6 · Directional trends across four speech axes. Dot marks the current SOTA entry from the catalogue.
§ 06
How it works

Two pipelines, one register.

Modern speech recognition takes raw audio into mel-spectrogram features, runs them through a Conformer or Transformer encoder, and decodes with CTC, RNNT or attention. Post-processing — language-model rescoring, punctuation, diarisation — yields the final transcript.

Modern speech synthesis runs the pipeline in reverse. Text is embedded by a language model; acoustic tokens are predicted autoregressively or by flow matching; a vocoder or neural codec decodes those tokens back to waveform. The neural audio codec — EnCodec, SoundStream, Mimi — is the hinge that lets TTS borrow the tooling of LLMs.

What changed recently is the representation. Once audio could be tokenised, every architectural trick from text generation became available to speech: pretraining, instruction-tuning, prompted style control, zero-shot cloning. That is why the open-source gap in TTS closed so quickly after 2023.

On the STT side, the Conformer block — self-attention plus convolution — is still the workhorse. Whisper took a different path with a pure Transformer encoder-decoder trained on weak supervision at scale, trading some efficiency for massive multilingual coverage.

Related

Neighbouring registers.

Other modality hubs on Codesota worth reading next.

Guide · TTS models
Long-form overview of the TTS landscape.
Guide · speech recognition
How ASR models are built, trained, evaluated.
OCR · register
Document understanding and text extraction.
LLM · register
Frontier language-model benchmarks.