← Speech Benchmarks/TTS Vendor Eval
Measured by CodeSOTA · v1 · 2026-04-14

TTS Vendor Evaluation: Kokoro v1.0 vs Gradium TTS (default)

First-party evaluation of production TTS systems measured in this repo — same code, same test set, same metrics across every vendor. Quality via UTMOS22-strong, intelligibility via Whisper-v3 WER round-trip, on 50 Harvard sentences. Reproducible with scripts/tts_synth_*.py and scripts/tts_score.py.

Methodology

Test Set
First 50 Harvard Sentences (IEEE Rec. Pub. No. 297)
Public domain, speech-intelligibility standard. Phonetically balanced.
Sample Count
50 sentences per vendor
Quality Metric (MOS)
UTMOS22-strong (Sarulab, VoiceMOS Challenge 2022 winner)
Automatic MOS predictor trained specifically on TTS naturalness ratings. Winner of VoiceMOS Challenge 2022. Loaded via torch.hub from tarepan/SpeechMOS. 1-5 scale, higher is better.
Intelligibility Metric (WER)
Whisper large-v3-turbo round-trip
Each synthesized clip transcribed by Whisper, WER computed against reference text after lowercasing and punctuation stripping ( jiwer). Catches dropped words, hallucinations, phoneme errors.
RTF (Real-Time Factor)
audio_s / synth_s, per-item mean
Gradium: HTTPS POST from EU region (network-bound). Kokoro: local torch CPU inference on Apple Silicon.
Audio Handling
Downsampled to 16 kHz mono before scoring
UTMOS and Whisper both operate natively at 16 kHz. Resampling is the intended input pipeline, not a lossy workaround.

Quality & Intelligibility

Measured on 50 sentences. Lower WER is better, higher UTMOS is better.

#Model / VoiceUTMOS ↑RangeWER ↓Type
1
Kokoro v1.0
Hexgrad (open source) · voice: af_heart
4.484.434.521.63%Open Source
2
Gradium TTS (default)
Gradium · voice: Audrey (flagship en)
4.414.184.532.23%Cloud API

Tight UTMOS gap (0.07) — within typical prediction noise. Kokoro's narrower range (4.43–4.52) indicates more consistent sentence-to-sentence quality; Gradium's wider range (4.19–4.53) has higher peaks and deeper troughs.

Deployment & Features

Where the two systems differ most — the second axis of the comparison.

ModelRTF ↑TTFBSample RateLanguagesVoicesCloningStreaming
Kokoro v1.0
Local / self-hosted
9.75×24 kHz854
Gradium TTS (default)
Cloud API · EU + US regions
2.71×200 ms48 kHz614

Caveats

  • Easy test set. Harvard sentences are short, phonetically balanced, clean English. They do not stress-test numbers, abbreviations, named entities, code-switching, or multi-sentence paragraphs — where production TTS systems actually diverge.
  • One voice per vendor. Different voices from the same vendor can score materially differently. We picked the flagship English voice in each catalog.
  • Whisper-v3 as oracle. WER is computed against Whisper-v3-turbo transcriptions — systematic Whisper errors (e.g. punctuation, casing, homophones) are normalized away but can still favor acoustically-typical TTS output.
  • No voice-cloning test yet. Voice cloning and Gradium's on-device model are advertised strengths this v1 does not exercise. Planned for v2.
  • RTF sample size differs. Gradium synth-time measurements (n=7) are lower because earlier cached runs skipped the sentences; Kokoro was fully measured (n=40). Gradium RTF cross-checked against the latency benchmark in data/gradium-tts-bench.json.

Our Take

On this clean English intelligibility test, Kokoro v1.0 and Gradium are essentially tied on naturalness (4.48 vs 4.41 UTMOS — 0.07 is within typical UTMOS noise). Kokoro edges intelligibility (1.63% vs 2.23% WER) and produces more consistent sentence-to-sentence quality.

This is a notable result for open-source TTS: an 82M-parameter Apache-2.0 model matching a commercial cloud API on naturalness for clean English text. Consistent with the broader 2024–2026 trend of small focused open-source models closing the gap with paid APIs.

Gradium's case for production deployment lives in the second table. A commercial API buys ~200 ms TTFB streaming, 48 kHz output, 14 flagship voices across 6 languages, voice cloning, and an on-device model — features Kokoro does not offer. For voice-agent pipelines where time-to-first-audio, multilingual coverage, or cloning matter more than 0.07 MOS on clean English, Gradium's tradeoffs are reasonable.

Roadmap for v2: SEED-TTS hardcase prompts (numbers, abbreviations, long paragraphs), voice-cloning similarity (SECS via WavLM), multilingual eval, and ElevenLabs + Cartesia + OpenAI TTS rows. If you're a TTS vendor and want to be in v2, email k.wikiel@gmail.com.