Codesota · Text-to-speech · Beta21 models tracked · MOS, TTFB, licenceUpdated 2026-06-23

§ 00 · Text-to-speech models

Text-to-speech models, ranked by use case.

Direct answer: Kokoro is the small local default, XTTS v2 remains relevant for multilingual voice cloning, and hosted models win when realtime latency and product controls matter. The hard question is whether a synthetic voice can be directed: emotion, pace, timbre, pauses, voice quality, and long-form performance without losing intelligibility.

CodeSOTA splits TTS evidence into measured runs, a reported registry, and an unscored watchlist. This page now leads with controllability research: Zonos emotion-vector experiments, Kokoro proxy controls, a Gradium dramatic-reading sweep, and the audio behind the claims.

0.60

Zonos pitch-only CV accuracy

1.55

Timbre channel shift

0.688

Kokoro tempo/pause baseline

4.412

Gradium measured UTMOS

Measured leaderboard Reported registry Watchlist Eval v1 →Intelligibility harness →Emotion research →

Read first

Emotion controllability

→

What channels carry emotion when text, speaker, and decoding are controlled.

Measured eval

UTMOS + WER round-trip

→

Independent Kokoro vs Gradium measurement on Harvard sentences.

Production baseline

Gradium sweep

→

Kent voice/config sweep against Caine's If, with line-level plots and audio.

Hard text

Information preservation

→

WER, CER, entity accuracy, and latency for production TTS.

§ 02 · Published research

What actually carries emotion in TTS?

A local research audit from the CodeSOTA lab: Zonos emotion-vector controls, Kokoro proxy controls, and a Gradium reading-style sweep against Michael Caine's performance of If. The thesis is measurable: synthetic emotion is not a vague "style" layer; it is carried by pitch variance, timbre, voice quality, energy, tempo, and pause structure.

Research source: /Users/kacper/Local/Ventures/lab/AUDIO. Zonos holds text, speaking rate, pitch_std, CFG, and speaker conditioning fixed while changing the 8D emotion vector. Kokoro is included as a useful negative control: it exposes speed and text recipes, not native emotion conditioning.

Research deck →Evidence plots

Channel	Effect	Interpretation
Timbre	1.55	MFCC mean/std and spectral centroid move even with text and speaker fixed.
Voice quality	1.09	Jitter and shimmer proxies shift as emotion vectors change phonation.
Pitch	1.00	F0 std and F0 range carry the clearest single-channel emotion signal.
Energy	0.90	RMS mean and variance rise on higher-arousal vectors.
Tempo	0.79	Rate shifts, but less cleanly than pitch and timbre in the Zonos run.
Pauses	0.70	Pause total and max pause separate some labels, especially low-energy speech.

Zonos mean absolute neutral-relative z by acoustic channel. Higher means the emotion vector changed that channel more.

Same sentence, fixed speaker

Neutral

Happy

Sad

Angry

Fearful

Surprised

Low energy

Is synthetic emotion a real controllable variable?

Zonos gives a clean test because emotion is an explicit 8D vector. Text, speaker, CFG, speaking rate, and pitch_std can be held fixed while the emotion vector moves.

Which physical channels carry the effect?

The strongest evidence is not one feature. Pitch variance, MFCC timbre shifts, pause structure, RMS variance, and voice-quality proxies all move with emotion labels.

Can a model without emotion controls fake it?

Kokoro can simulate labels through speed and punctuation recipes, but the classifier mostly learns tempo and pause structure. That makes it a useful limitation baseline.

Can the measurement explain performance, not only short clips?

The Gradium/Caine sweep tests a long reading of If line by line, comparing emotional evidence, rank agreement, distribution similarity, and exact line matches.

Research design

Built to survive the obvious objections.

This should read as a defensible experiment, not a gallery of nice clips. The controls, features, and scoring path are visible.

Design choice	Implementation	Why it matters
Text control	5 neutral sentences for Zonos, 8 neutral sentences for Kokoro	Keeps semantic leakage low.
Emotion conditions	neutral, happy, sad, angry, fearful, surprised, low-energy	Separates valence, arousal, and performance style.
Fixed speaker	same voice/reference where the engine allows it	Reduces speaker-identity leakage.
Feature extraction	F0, RMS, rate, pauses, MFCC, centroid, jitter, shimmer	Turns performance into measurable acoustic channels.
Scoring	ANOVA, mutual information, feature importance, channel-only CV	Ranks what carries emotion instead of only asking if clips sound good.
Long-form test	Gradium voice/config sweep against Caine's If reading	Checks whether short-sentence findings survive a dramatic reading.

Finding

Pitch is the cleanest single-channel classifier in Zonos.

F0 std leads ANOVA with F=9.01 and p=1.62e-05; F0 std and F0 range are also the top two mutual-information features.

Finding

Timbre is the largest overall movement, not a rounding error.

MFCC mean and std shifts rank second and third by ANOVA, while timbre has the highest mean absolute neutral-relative channel shift.

Finding

Pause structure is a real affect channel.

Pause total and pause max both appear in the top ANOVA features for Zonos, and they dominate the Kokoro proxy experiment.

Finding

Emotion controls can become mode changes.

The nonlinearity sweep is included because emotional intensity should be tested for breakpoints, not assumed to be a linear style slider.

Zonos emotion channels ranked by neutral-relative acoustic movement. — Slide evidence · Zonos channel effect
The emotion vector moves several physical channels at once.
This is the key plot that should not be a small thumbnail. Each bar is a channel-level movement relative to the same sentence in neutral voice. Timbre moves most, voice quality and pitch are close behind, and tempo/pause changes are still visible.
Timbre = 1.55 mean abs neutral-relative z
Read it as physical movement, not classifier accuracy. A large bar means the audio changed there; it does not automatically mean that channel is the best label decoder.

Zonos emotion vector matrix showing controlled emotion dimensions. — Slide evidence · experiment control
The input is an emotion-vector matrix, not vague prompt wording.
The experiment changes the explicit Zonos emotion vector while holding the rest of the generation setup fixed. That makes this closer to a causal probe than a prompt-vibes comparison.
7 conditions across an 8D emotion vector
Rows are intended conditions. Columns are emotion-vector dimensions. The result matrix starts after generation and feature extraction.

Feature extraction map for pitch, tempo, energy, pauses, timbre, and voice quality. — Slide evidence · feature extraction
The features map emotion into measurable audio physics.
The page should show the measurement instrument. F0, RMS, rate, pauses, spectral centroid, MFCC movement, jitter proxy, and shimmer proxy are the bridge between audio and the research claim.
6 acoustic channels · 20+ derived features
This is why the page can say which channel carries emotion instead of merely saying which clip sounds emotional.

Heatmap of physical acoustic dimensions by emotion condition. — Slide evidence · physical dimensions
Emotion is not just higher pitch.
The heatmap shows which acoustic dimensions rise or fall for each label. Fearful has broad movement across pauses, tempo, energy, timbre, and centroid. Happy is not simply higher pitch.
F0 std ANOVA p = 1.62e-05
Cell values are normalized changes relative to the neutral output of the same sentence, so text difficulty is largely controlled.

MFCC timbre fingerprint changes across Zonos emotion outputs. — Slide evidence · timbre fingerprint
Timbre changes are measurable, not decorative.
MFCC coefficients summarize the spectral envelope, so this view captures voice color more directly than a raw waveform or subjective listening note.
MFCC std shift ANOVA p = 2.97e-05
The MFCC heatmap becomes stable quantitative evidence: it can be averaged, tested, and fed into a classifier.

Zonos emotion intensity sweep showing nonlinear acoustic behavior. — Slide evidence · nonlinearity
Emotion intensity is not a straight slider.
The happy and angry sweeps do not behave like one monotonic knob. Some features jump early, some peak in the middle, and some move mostly near the endpoint.
Fixed seed across alpha values
This supports the nonlinearity hypothesis: emotion-vector changes alter model state, not one acoustic scalar.

Michael Caine If reading emotion contrast heatmap. — Slide evidence · real speech target
Caine's reading is a trajectory, not one emotion label.
The real-speech target shows local rises in fearful, angry, sad, low-energy, happy, and surprised evidence. The task becomes trajectory matching across poem lines.
547 windows · 37 transcript segments
Contrast subtracts neutral/common speech activity, so the plot shows what increases locally, not just loudness.

Artifact diagnosis for clipped Zonos poem chunks. — Slide evidence · artifact audit
Bad generation hygiene can fake an emotion match.
The first fixed-speaker poem had hard stops caused by token caps. Those artifacts changed the score. The page needs this because it shows the research is skeptical of its own outputs.
Clipped chunks around 6 seconds
A better-looking score is not useful if artifacts are driving the acoustic evidence.

Zonos target run multidimensional line-level evidence for Kipling If. — Slide evidence · multidimensional performance
Dominant emotion is only the readable summary.
A poem line can have fearful, angry, low-energy, and pitch/energy evidence at the same time. The multidimensional line view is the real evidence layer.
Line-level vectors, not one-hot labels
Use this to explain why matching a human actor requires a trajectory of physical dimensions, not selecting one emotion per line.

§ 02B · Gradium baseline

Gradium is the production counterweight to the open-weight probes.

Zonos is the clean causal experiment because it exposes the emotion vector. Gradium belongs on this page for a different reason: it is the vendor-grade system we measured for naturalness, information preservation, latency, and long-form dramatic reading.

The best Gradium run uses the Kent voice with a slow-loose decoding setup. It scored 0.368 against the Caine target in the local acoustic-emotion space, while the separate intelligibility suite ranks Gradium first on hard prompts.

4.412

UTMOS on 50 Harvard sentences

Clean-sentence naturalness, measured by CodeSOTA.

13.4%

normalized WER

Best run on the 30-prompt hard intelligibility suite.

299 ms

p95 first-byte latency

Measured Gradium Audrey run, useful for voice agents.

0.368

Caine sweep score

Kent slow-loose best match against the dramatic-reading target.

Best line-level Gradium match against the target Caine emotion timeline. — Best Gradium line match
Line-level matching makes the evaluation inspectable instead of hiding the result behind a single score.

Axis	Value	Read
Voice	Kent	Best Gradium voice in the sweep.
Config	cfg 1.2 · padding 0.6 · temp 0.85	The winning decoding/control setup.
Line pacing	0.3s line · 1.0s stanza	Keeps the poem readable without hiding timing drift.
Rank agreement	0.480	How well line-level emotion ordering follows Caine's target.
Distribution similarity	0.406	Whether the whole poem occupies a similar emotion mix.
Target evidence	0.496	How much target-emotion evidence appears in generated lines.

Gradium voice and configuration sweep leaderboard against the Michael Caine If reading. — Gradium sweep leaderboard
The leaderboard makes Gradium inspectable as a production baseline: voice choice, pacing, and decoding settings are compared against the same Caine target.

Gradium full If poem transcript colored by predicted emotion line by line. — Line-level transcript
The full transcript view shows the generated performance as a sequence of local emotion decisions, not a single global label.

Gradium synthetic clone full If poem transcript colored by predicted emotion. — Clone transcript stress test
The clone run is included as a stress test for whether voice identity and long-form emotional structure survive together.

Gradium performance sample

The Kent slow-loose run was the best sweep candidate: 0.368 weighted score, 0.481 rank agreement, and 0.496 target-emotion evidence. The scoring caveat is explicit: this is acoustic similarity to a local Zonos emotion space, not a human emotion recognizer.

Control surface

Which TTS models are experiment-ready?

Model	Controls	Role
Zonos	8D emotion vector, rate, pitch std, speaker embedding	Primary open-weight causal test
IndexTTS-2	8-float emotion vector, text-emotion mode, speaker prompt	Best cross-check candidate
Chatterbox	exaggeration, CFG, temperature, reference audio	Nonlinearity and stability stress test
OmniVoice	pitch, whisper, speed, duration, tags	Acoustic-channel isolation baseline
FastPitch / FastSpeech 2	explicit pitch, duration, energy	White-box mechanism baseline
Gradium	voice, CFG coefficient, padding bonus, temperature, pacing	Production performance and long-form reading sweep

What this adds to the leaderboard

MOS answers whether a voice sounds natural. This research asks whether a voice can be directed.

For voice agents, audiobooks, tutoring, games, and dubbing, the decisive question is whether the model obeys performance controls without destroying intelligibility, identity, or timing.

Limitations

Zonos labels are model-control labels, not direct human affect labels.
Kokoro labels are proxy recipes, so high accuracy can mean the classifier learned pacing.
Jitter and shimmer are frame-level proxies, not Praat-grade cycle measurements.
The Caine comparison scores acoustic similarity to a local emotion space, not acting quality.

Next benchmark

The next step is a public controllability suite.

Semantics vs direction

Positive, negative, and neutral sentences crossed with contradictory voice directions.

Emotion vs speaker identity

Measure whether strong emotion reduces speaker embedding similarity to a fixed reference.

Temporal emotion arcs

Ask a model to move from calm to fear to anger inside one utterance and score window-level smoothness.

Prosody hallucinations

Lower stability or raise style strength until tempo, F0, WER, or alignment breaks.

Vendor-neutral markup

Draft an Emotional Direction Markup layer and map it to Zonos, Gradium, ElevenLabs, and OmniVoice-style controls.

§ 03 · TTS registry

Registry rows, tiered.

Reported MOS stays visible, but it is not ranked as equivalent to CodeSOTA-measured runs. Sub-0.1 MOS gaps are marked as noise.

For rankable in-repo numbers see the measured leaderboard.

Model	Vendor	Kind	Verification	Architecture	Params	MOS	MOS note	Year
ElevenLabs Turbo v2.5	ElevenLabs	Cloud API	vendor reported	Proprietary (diffusion-based)	—	4.8	within MOS noise	2024
Sesame CSM	Sesame	Open Source	community reported	Conversational Speech Model	1B+	4.7	within MOS noise	2025
OpenAI TTS HD	OpenAI	Cloud API	vendor reported	Proprietary	—	4.7	within MOS noise	2023
Gemini 2.5 Pro TTS	Google	Cloud API	vendor reported	Multimodal LLM (native audio)	—	4.7	within MOS noise	2025
Cartesia Sonic 2	Cartesia	Cloud API	vendor reported	State-space model	—	4.7	within MOS noise	2025
ElevenLabs Flash v2.5	ElevenLabs	Cloud API	vendor reported	Proprietary (optimized)	—	4.6	reported MOS; no CodeSOTA CI	2025
PlayHT 3.0	PlayHT	Cloud API	vendor reported	Proprietary	—	4.6	reported MOS; no CodeSOTA CI	2025
Fish Audio S2 Pro	Fish Audio	Open Source	paper reported	Dual-autoregressive transformer + RVQ audio codec	5B	4.6	reported MOS; no CodeSOTA CI	2026
Orpheus TTS	Canopy Labs	Open Source	community reported	LLM-based (Llama backbone)	3B	4.6	reported MOS; no CodeSOTA CI	2025
Gemini 2.5 Flash TTS	Google	Cloud API	vendor reported	Multimodal LLM (native audio)	—	4.5	reported MOS; no CodeSOTA CI	2025
Kokoro v1.0	Hexgrad	Open Source	codesota measured	Lightweight autoregressive	82M	4.5	no CI yet; measured run exposes sample count and artifacts	2025
XTTS v2	Coqui	Open Source	paper reported	GPT-like + VITS decoder	467M	4.5	reported MOS; no CodeSOTA CI	2024
Google Chirp 3 HD	Google	Cloud API	vendor reported	Generative (USM-based)	—	4.4	reported MOS; no CodeSOTA CI	2025
Gradium TTS	Gradium	Cloud API	codesota measured	Proprietary neural TTS	—	4.4	no CI yet; measured run exposes sample count and artifacts	2026
Fish Speech 1.5	Fish Audio	Open Source	community reported	VQGAN + Transformer	500M	4.4	reported MOS; no CodeSOTA CI	2025
F5-TTS	Shanghai AI Lab	Open Source	paper reported	Flow-matching (non-autoregressive)	335M	4.4	reported MOS; no CodeSOTA CI	2024
Dia 1.6B	Nari Labs	Open Source	community reported	Transformer + non-verbal tokens	1.6B	4.3	reported MOS; no CodeSOTA CI	2025
Spark-TTS	SparkAudio	Open Source	community reported	Controllable Transformer	500M	4.3	reported MOS; no CodeSOTA CI	2025
Supertonic 3	Supertone	Open Source	community reported	ONNX Runtime local inference	99M	4.2	reported MOS; no CodeSOTA CI	2026
Parler-TTS	Hugging Face	Open Source	paper reported	Prompt-controlled Transformer	880M	4.1	reported MOS; no CodeSOTA CI	2025
Piper	Rhasspy	Open Source	community reported	VITS (lightweight)	~20M	3.6	reported MOS; no CodeSOTA CI	2023

Fig 1 · MOS on a 1-5 scale. Listener panels and reference audio differ across sources; gaps under 0.1 are noise.

§ 04 · Picks

By use-case.

MOS rankings are one axis. The model you actually want depends on what you're building — voice agents, audiobooks, on-device assistants each pull in different directions.

Real-time voice agents

Cartesia Sonic 2

Sub-200ms TTFB with streaming

<90ms TTFB, state-space architecture purpose-built for interactive voice. ElevenLabs Flash v2.5 is the fallback at ~120ms.

Maximum naturalness

ElevenLabs Turbo v2.5

Highest literature MOS

4.8 MOS — widely considered indistinguishable from human. Massive voice library and commercial cloning.

Open-source on-device

Kokoro v1.0

Apache / MIT license, <1B params, CPU-friendly

82M params, Apache-2.0, ~10x real-time on CPU. Tied with commercial APIs on CodeSOTA's independent UTMOS eval (4.48).

Expressive / dialogue

Sesame CSM

Handles emotion, laughter, pauses, multi-speaker

Conversational Speech Model — 4.7 MOS with emotional expressiveness. Dia 1.6B is the alternative for scripted dialogue with non-verbal cues.

Voice cloning (zero-shot)

F5-TTS / XTTS v2

Clone from seconds of reference audio

F5-TTS uses flow-matching for fast cloning; XTTS v2 covers 17 languages. Orpheus TTS is the LLM-based alternative with emotion tags.

Multilingual breadth

Google Chirp 3 HD

30+ languages supported

31 languages, 8 voice personas, instant cloning. Fish Speech 1.5 is the open-source alternative for CJK-heavy deployments.

§ 05 · Open vs cloud

The gap nearly closed.

As of 2026 the quality gap is nearly closed: the best open-source TTS (Sesame CSM, 4.7 MOS) is within 0.1–0.3 of the top commercial API (ElevenLabs Turbo v2.5, 4.8 MOS). The remaining cloud advantages are in infrastructure, not naturalness.

When to go open source

Data residency or air-gapped deployment
High volume where per-character pricing hurts
Custom fine-tuning on your voice or domain
Edge / on-device inference (Kokoro, Piper)
Full reproducibility for research

When to go cloud API

Sub-200ms TTFB streaming (Cartesia, ElevenLabs Flash)
Professional voice cloning with licensing support
Broad multilingual coverage (Chirp 3, ElevenLabs)
Managed infrastructure, SLA, autoscaling
You don't want to host a GPU

§ 06

How TTS is scored

MOS, and what it misses.

Human raters listen to generated audio and score naturalness 1 (bad) to 5 (indistinguishable from human). A modern TTS target is 4.5+. Reference recordings of real speech typically score 4.5–4.7 — the ceiling.

Because real MOS studies are slow and expensive, papers often use automatic MOS predictors like UTMOS, trained on crowdsourced ratings. UTMOS correlates around 0.9 with true MOS on TTS-like audio — good enough for ranking, noisy enough that small gaps (<0.1) should not be treated as decisive.

MOS is one axis. The model you actually pick is a function of latency, licence, voice library, and intelligibility on hard text. The best MOS can lose to a worse-MOS model that ships a better voice library and sub-200ms TTFB.

Neighbouring registers.

Independent TTS eval →

First-party UTMOS + WER measurement.

Speech hub · STT + TTS →

Combined register, 35+ models.

Speech-to-text →

The paired STT leaderboard.

Guide · TTS models →

Long-form walkthrough of the landscape.

Beta · MOS column is literature-sourced for most rows. See the independent eval for numbers measured in-repo. Feedback to k.wikiel@gmail.com.

Text-to-speech models, ranked by use case.

Emotion controllability

UTMOS + WER round-trip

Gradium sweep

Information preservation

What actually carries emotion in TTS?

Is synthetic emotion a real controllable variable?

Which physical channels carry the effect?

Can a model without emotion controls fake it?

Can the measurement explain performance, not only short clips?

Built to survive the obvious objections.

Pitch is the cleanest single-channel classifier in Zonos.

Timbre is the largest overall movement, not a rounding error.

Pause structure is a real affect channel.

Emotion controls can become mode changes.

The emotion vector moves several physical channels at once.

The input is an emotion-vector matrix, not vague prompt wording.

The features map emotion into measurable audio physics.

Emotion is not just higher pitch.

Timbre changes are measurable, not decorative.

Emotion intensity is not a straight slider.

Caine's reading is a trajectory, not one emotion label.

Bad generation hygiene can fake an emotion match.

Dominant emotion is only the readable summary.

Gradium is the production counterweight to the open-weight probes.

Which TTS models are experiment-ready?

The next step is a public controllability suite.

Registry rows, tiered.

By use-case.

Cartesia Sonic 2

ElevenLabs Turbo v2.5

Kokoro v1.0

Sesame CSM

F5-TTS / XTTS v2

Google Chirp 3 HD

The gap nearly closed.

MOS, and what it misses.

Neighbouring registers.