Codesota · Speech-to-text · Beta19 models ranked · mean WER · HF Open ASR LeaderboardUpdated 2026-06-05
§ 00 · Speech-to-text

Word error rate, ranked.

A dedicated STT register — decoupled from TTS. The HF Open ASR Leaderboard, use-case picks, and a landscape where open models lead the public leaderboard outright.

19 models ranked by mean Word Error Rate across the eight datasets of the HF Open ASR Leaderboard — AMI, Earnings-22, GigaSpeech, LibriSpeech clean/other, SPGISpeech, TED-LIUM and VoxPopuli. Lower is better. LibriSpeech test-clean alone is saturated near 1–2%; mean WER over noisy, meeting and telephony audio is what now separates the field.

§ 01 · Open ASR Leaderboard

The leaderboard, top to tail.

Ranked by mean WER across the eight datasets of the HF Open ASR Leaderboard. Shaded row marks current SOTA.


Figures from the HF Open ASR Leaderboard (en_shortform), accessed 2026-05-22. Lower is better.

#ModelVendorKindArchitectureParamsMean WERYear
01Granite Speech 4.1 2BIBMOpen SourceSpeech-aware LLM (Granite)2B5.332025
02Cohere Transcribe (Mar 2026)CohereOpen SourceTransformer ASR2B5.422026
03Pulse ProSmallest AICloud APIProprietary ASR5.422026
04Zoom Scribe v1ZoomCloud APIProprietary5.472025
05Granite 4.0 1B SpeechIBMOpen SourceSpeech-aware LLM (Granite)1B5.522025
06Canary-Qwen-2.5BNVIDIAOpen SourceFastConformer encoder + Qwen2 LM decoder2.5B5.632025
07Granite Speech 3.3 8BIBMOpen SourceSpeech-aware LLM (Granite)8B5.742025
08Qwen3-ASR-1.7BAlibabaOpen SourceQwen3 backbone fine-tuned for ASR1.7B5.762025
09ElevenLabs Scribe v2ElevenLabsCloud APIProprietary5.832025
10Phi-4 Multimodal InstructMicrosoftOpen SourcePhi-4 multimodal6B6.022025
11Parakeet TDT 0.6B v2NVIDIAOpen SourceFastConformer (TDT)0.6B6.052025
12AssemblyAI Universal-3 ProAssemblyAICloud APIProprietary Conformer-based6.212025
13Canary 1BNVIDIAOpen SourceFastConformer + multi-task1B6.502024
14Voxtral Small 24BMistral AIOpen SourceLarge multimodal LM with audio encoder24B6.622025
15Google Chirp 3GoogleCloud APIGenerative (USM-based)6.632025
16Parakeet TDT 1.1BNVIDIAOpen SourceFastConformer (TDT)1.1B7.022024
17Voxtral Mini 3BMistral AIOpen SourceAudio-Language Model (Transformer)3B7.052025
18Whisper Large v3OpenAIOpen SourceTransformer Encoder-Decoder1.55B7.442023
19Whisper Large v3 TurboOpenAIOpen SourceTransformer Encoder-Decoder (pruned decoder)809M7.832024
Fig 1 · Mean WER % across the 8 HF Open ASR Leaderboard datasets. Highlight on current SOTA row.
§ 02 · Picks

By use-case.

Lowest WER isn't always what you want. Streaming latency, language coverage, and hardware constraints often matter more than a fraction of a percentage point.

Maximum accuracy

Granite Speech 4.1 2B

Lowest mean WER on the HF Open ASR Leaderboard

5.33% mean WER across 8 datasets — #1 on the Open ASR Leaderboard. A 2B open model that holds up on noisy, meeting and telephony audio, not just clean read speech.

Real-time streaming

Deepgram Nova-3

Sub-300ms latency with partial results

Purpose-built for streaming. Nova-3 maintains strong WER while delivering partial hypotheses in real time. Gladia and Speechmatics Flow are alternatives.

Multilingual (100+ languages)

Whisper Large v3 Turbo

Broad language coverage with consistent quality

100+ languages in a single model, 2.5% WER on English. Voxtral Large is the newer alternative with audio Q&A capabilities.

Audio understanding (not just transcription)

Voxtral Large

Audio Q&A, translation, spoken instructions

Mistral's audio-language model. ~2.3% WER plus multimodal LLM capabilities Whisper lacks: audio Q&A, translation, spoken instruction following.

Edge / on-device

Whisper Small / Moonshine

Runs on CPU, mobile, or Raspberry Pi

Whisper Small is the pragmatic CPU choice. Moonshine from UsefulSensors is 5x faster than Whisper Tiny with better accuracy on-device.

Fastest inference

Groq Whisper

LPU-accelerated or specialized hardware

Groq's LPU delivers Whisper inference ~150x faster than real time. Best choice when you need to batch-process large audio corpora quickly.

§ 03 · Open vs cloud

The picture inverted.

In 2026 open-source STT has flipped the historical picture: Granite Speech 4.1 2B (5.33% mean WER) tops the Open ASR Leaderboard, ahead of every proprietary system measured on it. Cloud APIs still win on streaming infrastructure, multilingual scale, and managed hosting.

When to go open source
  • Lowest-WER requirement (Granite Speech, Canary-Qwen)
  • Data residency and compliance
  • Offline or edge deployment (Moonshine, Whisper Small)
  • High-volume batch processing (Whisper on Groq)
  • Research and reproducibility
When to go cloud API
  • Real-time streaming with partials (Deepgram Nova-3)
  • Broad multilingual out-of-the-box (AssemblyAI)
  • Managed infra, SLA, autoscaling
  • Diarization and speaker labeling built-in
  • Fast vendor iteration on new languages and domains
§ 04
How STT is scored

WER, and what it misses.

Word Error Rate counts substitutions, deletions and insertions divided by reference length. WER = (S + D + I) / N. Lower is better. Human transcribers sit at 2–4% on clean read speech, over 10% on noisy conversational audio, and over 20% on heavy accents or low-resource languages.

WER is a compressed signal. Most reports normalise away punctuation, casing and numerics; batch-mode WER says nothing about streaming latency; WER under-reports hallucination on silence. For anything past picking the leader, you want domain-specific evaluation.

Related

Neighbouring registers.

Text-to-speech
The paired TTS leaderboard and picks-by-use-case.
Speech hub · STT + TTS
Combined register with papers, repos, trends.
Audio-to-text building block
Integration guide: API shapes, streaming, diarisation.
Audio benchmarks
Classification, music generation, audio understanding.
Beta · Mean WER figures from the HF Open ASR Leaderboard (en_shortform), accessed 2026-05-22. Feedback to k.wikiel@gmail.com.