Audio AI

The Complete
Speech AI Benchmark

Compare the best models for both Speech-to-Text (STT) and Text-to-Speech (TTS). From Whisper to ElevenLabs, see who leads the charts.

Benchmark Stats

2.0%
Best WER (STT)
4.8
Best MOS (TTS)
20+
Models Compared

Speech-to-Text (STT)

Word Error Rate (WER)

WER measures the percentage of words incorrectly transcribed. It counts three types of errors:

S

Substitutions

Wrong word: "the cat" becomes "the car"

D

Deletions

Missing word: "the big cat" becomes "the cat"

I

Insertions

Extra word: "the cat" becomes "the big cat"

wer_example.py
from jiwer import wer

reference = "the quick brown fox"
hypothesis = "the quik brown cat"

error_rate = wer(reference, hypothesis)
print("WER:", round(error_rate * 100, 1), "%")
Output: WER: 50.0 % | 2/4 words wrong

STT Leaderboard

WER on LibriSpeech test-clean. Lower is better.

Rank Model WER (%) Type Year
#1
Conformer XL
Google
2.0 Research 2021
#2
Whisper Large v3
OpenAI
2.7 Open Source 2024
#3
Google USM
Google
2.8 Cloud API 2023
#4
Azure Speech
Microsoft
3.0 Cloud API 2024
#5
Whisper Medium
OpenAI
3.4 Open Source 2023
#6
wav2vec 2.0
Meta
3.8 Open Source 2020

STT Datasets

LibriSpeech

2015

1000 hours of English speech from audiobooks. Standard benchmark for automatic speech recognition.

Common Voice

2019

Massive multilingual dataset of transcribed speech. Covers diverse demographics and accents.

Text-to-Speech (TTS)

Mean Opinion Score (MOS)

TTS is harder to evaluate objectively than STT. The gold standard is MOS: human raters listen to generated audio and rate it from 1 (Bad) to 5 (Excellent).

5
Excellent (Human-like, natural intonation)
4
Good (Intelligible, minor robotic artifacts)
3
Fair (Understandable but clearly synthetic)

Other TTS Metrics

  • MCD (Mel Cepstral Distortion)

    Objective distance between generated and reference audio. Lower is better.

  • Latency (Time-to-First-Byte)

    Critical for voice bots. Best models achieve < 200ms.

  • Word Accuracy

    Does it skip words or hallucinate? Checked via STT on output.

TTS Leaderboard

Approximate MOS ratings based on community benchmarks and paper results. Higher is better.

Rank Model MOS (1-5) Type Year
#1
ElevenLabs Turbo v2.5
ElevenLabs
4.8 Cloud API 2024
#2
OpenAI TTS HD
OpenAI
4.7 Cloud API 2023
#3
XTTS v2
Coqui
4.5 Open Source 2024
#4
MMS-TTS
Meta
4.0 Open Source 2023
#5
Bark
Suno
3.9 Open Source 2023
#6
Piper
Rhasspy
3.6 Open Source 2023

TTS Datasets

LJ Speech

2017

13,100 short audio clips of a single speaker reading passages from non-fiction books. Standard benchmark for single-speaker TTS.

VCTK

2019

Speech data from 110 English speakers with various accents. Used for multi-speaker TTS.

Summary: Which Model Should You Use?

Speech-to-Text

Best Overall & Local
Whisper Large v3 (OpenAI) - Free, accurate, runs on consumer GPU.
Best for Streaming
Deepgram / Azure Speech - Extremely low latency for real-time apps.

Text-to-Speech

Best Quality
ElevenLabs - Indistinguishable from human speech, emotive.
Best Open Source
XTTS v2 (Coqui) - Voice cloning and high quality, runs locally.