Codesota · Text-to-speech · Beta21 models tracked · MOS, TTFB, licenceUpdated 2026-06-05
§ 00 · Text-to-speech models

Text-to-speech models, ranked by use case.

Direct answer: Kokoro is the small local default, XTTS v2 remains relevant for multilingual voice cloning, and hosted models win when realtime latency and product controls matter. The hard question is whether a synthetic voice can be directed: emotion, pace, timbre, pauses, voice quality, and long-form performance without losing intelligibility.

CodeSOTA splits TTS evidence into measured runs, a reported registry, and an unscored watchlist. This page now leads with controllability research: Zonos emotion-vector experiments, Kokoro proxy controls, a Gradium dramatic-reading sweep, and the audio behind the claims.

0.60
Zonos pitch-only CV accuracy
1.55
Timbre channel shift
0.688
Kokoro tempo/pause baseline
4.412
Gradium measured UTMOS
Read first

Emotion controllability

What channels carry emotion when text, speaker, and decoding are controlled.

Measured eval

UTMOS + WER round-trip

Independent Kokoro vs Gradium measurement on Harvard sentences.

Production baseline

Gradium sweep

Kent voice/config sweep against Caine's If, with line-level plots and audio.

Hard text

Information preservation

WER, CER, entity accuracy, and latency for production TTS.

§ 02 · Published research

What actually carries emotion in TTS?

A local research audit from the CodeSOTA lab: Zonos emotion-vector controls, Kokoro proxy controls, and a Gradium reading-style sweep against Michael Caine's performance of If. The thesis is measurable: synthetic emotion is not a vague "style" layer; it is carried by pitch variance, timbre, voice quality, energy, tempo, and pause structure.

Research source: /Users/kacper/Local/Ventures/lab/AUDIO. Zonos holds text, speaking rate, pitch_std, CFG, and speaker conditioning fixed while changing the 8D emotion vector. Kokoro is included as a useful negative control: it exposes speed and text recipes, not native emotion conditioning.

ChannelEffectInterpretation
Timbre1.55MFCC mean/std and spectral centroid move even with text and speaker fixed.
Voice quality1.09Jitter and shimmer proxies shift as emotion vectors change phonation.
Pitch1.00F0 std and F0 range carry the clearest single-channel emotion signal.
Energy0.90RMS mean and variance rise on higher-arousal vectors.
Tempo0.79Rate shifts, but less cleanly than pitch and timbre in the Zonos run.
Pauses0.70Pause total and max pause separate some labels, especially low-energy speech.
Zonos mean absolute neutral-relative z by acoustic channel. Higher means the emotion vector changed that channel more.
Same sentence, fixed speaker
Neutral
Happy
Sad
Angry
Fearful
Surprised
Low energy
01

Is synthetic emotion a real controllable variable?

Zonos gives a clean test because emotion is an explicit 8D vector. Text, speaker, CFG, speaking rate, and pitch_std can be held fixed while the emotion vector moves.

02

Which physical channels carry the effect?

The strongest evidence is not one feature. Pitch variance, MFCC timbre shifts, pause structure, RMS variance, and voice-quality proxies all move with emotion labels.

03

Can a model without emotion controls fake it?

Kokoro can simulate labels through speed and punctuation recipes, but the classifier mostly learns tempo and pause structure. That makes it a useful limitation baseline.

04

Can the measurement explain performance, not only short clips?

The Gradium/Caine sweep tests a long reading of If line by line, comparing emotional evidence, rank agreement, distribution similarity, and exact line matches.

Research design

Built to survive the obvious objections.

This should read as a defensible experiment, not a gallery of nice clips. The controls, features, and scoring path are visible.

Design choiceImplementationWhy it matters
Text control5 neutral sentences for Zonos, 8 neutral sentences for KokoroKeeps semantic leakage low.
Emotion conditionsneutral, happy, sad, angry, fearful, surprised, low-energySeparates valence, arousal, and performance style.
Fixed speakersame voice/reference where the engine allows itReduces speaker-identity leakage.
Feature extractionF0, RMS, rate, pauses, MFCC, centroid, jitter, shimmerTurns performance into measurable acoustic channels.
ScoringANOVA, mutual information, feature importance, channel-only CVRanks what carries emotion instead of only asking if clips sound good.
Long-form testGradium voice/config sweep against Caine's If readingChecks whether short-sentence findings survive a dramatic reading.
Finding

Pitch is the cleanest single-channel classifier in Zonos.

F0 std leads ANOVA with F=9.01 and p=1.62e-05; F0 std and F0 range are also the top two mutual-information features.

Finding

Timbre is the largest overall movement, not a rounding error.

MFCC mean and std shifts rank second and third by ANOVA, while timbre has the highest mean absolute neutral-relative channel shift.

Finding

Pause structure is a real affect channel.

Pause total and pause max both appear in the top ANOVA features for Zonos, and they dominate the Kokoro proxy experiment.

Finding

Emotion controls can become mode changes.

The nonlinearity sweep is included because emotional intensity should be tested for breakpoints, not assumed to be a linear style slider.

Zonos emotion channels ranked by neutral-relative acoustic movement.
Slide evidence · Zonos channel effect

The emotion vector moves several physical channels at once.

This is the key plot that should not be a small thumbnail. Each bar is a channel-level movement relative to the same sentence in neutral voice. Timbre moves most, voice quality and pitch are close behind, and tempo/pause changes are still visible.

Timbre = 1.55 mean abs neutral-relative z
Read it as physical movement, not classifier accuracy. A large bar means the audio changed there; it does not automatically mean that channel is the best label decoder.
Zonos emotion vector matrix showing controlled emotion dimensions.
Slide evidence · experiment control

The input is an emotion-vector matrix, not vague prompt wording.

The experiment changes the explicit Zonos emotion vector while holding the rest of the generation setup fixed. That makes this closer to a causal probe than a prompt-vibes comparison.

7 conditions across an 8D emotion vector
Rows are intended conditions. Columns are emotion-vector dimensions. The result matrix starts after generation and feature extraction.
Feature extraction map for pitch, tempo, energy, pauses, timbre, and voice quality.
Slide evidence · feature extraction

The features map emotion into measurable audio physics.

The page should show the measurement instrument. F0, RMS, rate, pauses, spectral centroid, MFCC movement, jitter proxy, and shimmer proxy are the bridge between audio and the research claim.

6 acoustic channels · 20+ derived features
This is why the page can say which channel carries emotion instead of merely saying which clip sounds emotional.
Heatmap of physical acoustic dimensions by emotion condition.
Slide evidence · physical dimensions

Emotion is not just higher pitch.

The heatmap shows which acoustic dimensions rise or fall for each label. Fearful has broad movement across pauses, tempo, energy, timbre, and centroid. Happy is not simply higher pitch.

F0 std ANOVA p = 1.62e-05
Cell values are normalized changes relative to the neutral output of the same sentence, so text difficulty is largely controlled.
MFCC timbre fingerprint changes across Zonos emotion outputs.
Slide evidence · timbre fingerprint

Timbre changes are measurable, not decorative.

MFCC coefficients summarize the spectral envelope, so this view captures voice color more directly than a raw waveform or subjective listening note.

MFCC std shift ANOVA p = 2.97e-05
The MFCC heatmap becomes stable quantitative evidence: it can be averaged, tested, and fed into a classifier.
Zonos emotion intensity sweep showing nonlinear acoustic behavior.
Slide evidence · nonlinearity

Emotion intensity is not a straight slider.

The happy and angry sweeps do not behave like one monotonic knob. Some features jump early, some peak in the middle, and some move mostly near the endpoint.

Fixed seed across alpha values
This supports the nonlinearity hypothesis: emotion-vector changes alter model state, not one acoustic scalar.
Michael Caine If reading emotion contrast heatmap.
Slide evidence · real speech target

Caine's reading is a trajectory, not one emotion label.

The real-speech target shows local rises in fearful, angry, sad, low-energy, happy, and surprised evidence. The task becomes trajectory matching across poem lines.

547 windows · 37 transcript segments
Contrast subtracts neutral/common speech activity, so the plot shows what increases locally, not just loudness.
Artifact diagnosis for clipped Zonos poem chunks.
Slide evidence · artifact audit

Bad generation hygiene can fake an emotion match.

The first fixed-speaker poem had hard stops caused by token caps. Those artifacts changed the score. The page needs this because it shows the research is skeptical of its own outputs.

Clipped chunks around 6 seconds
A better-looking score is not useful if artifacts are driving the acoustic evidence.
Zonos target run multidimensional line-level evidence for Kipling If.
Slide evidence · multidimensional performance

Dominant emotion is only the readable summary.

A poem line can have fearful, angry, low-energy, and pitch/energy evidence at the same time. The multidimensional line view is the real evidence layer.

Line-level vectors, not one-hot labels
Use this to explain why matching a human actor requires a trajectory of physical dimensions, not selecting one emotion per line.
§ 02B · Gradium baseline

Gradium is the production counterweight to the open-weight probes.

Zonos is the clean causal experiment because it exposes the emotion vector. Gradium belongs on this page for a different reason: it is the vendor-grade system we measured for naturalness, information preservation, latency, and long-form dramatic reading.

The best Gradium run uses the Kent voice with a slow-loose decoding setup. It scored 0.368 against the Caine target in the local acoustic-emotion space, while the separate intelligibility suite ranks Gradium first on hard prompts.

4.412
UTMOS on 50 Harvard sentences
Clean-sentence naturalness, measured by CodeSOTA.
13.4%
normalized WER
Best run on the 30-prompt hard intelligibility suite.
299 ms
p95 first-byte latency
Measured Gradium Audrey run, useful for voice agents.
0.368
Caine sweep score
Kent slow-loose best match against the dramatic-reading target.
Best line-level Gradium match against the target Caine emotion timeline.
Best Gradium line match
Line-level matching makes the evaluation inspectable instead of hiding the result behind a single score.
AxisValueRead
VoiceKentBest Gradium voice in the sweep.
Configcfg 1.2 · padding 0.6 · temp 0.85The winning decoding/control setup.
Line pacing0.3s line · 1.0s stanzaKeeps the poem readable without hiding timing drift.
Rank agreement0.480How well line-level emotion ordering follows Caine's target.
Distribution similarity0.406Whether the whole poem occupies a similar emotion mix.
Target evidence0.496How much target-emotion evidence appears in generated lines.
Gradium voice and configuration sweep leaderboard against the Michael Caine If reading.
Gradium sweep leaderboard
The leaderboard makes Gradium inspectable as a production baseline: voice choice, pacing, and decoding settings are compared against the same Caine target.
Gradium full If poem transcript colored by predicted emotion line by line.
Line-level transcript
The full transcript view shows the generated performance as a sequence of local emotion decisions, not a single global label.
Gradium synthetic clone full If poem transcript colored by predicted emotion.
Clone transcript stress test
The clone run is included as a stress test for whether voice identity and long-form emotional structure survive together.
Gradium performance sample

The Kent slow-loose run was the best sweep candidate: 0.368 weighted score, 0.481 rank agreement, and 0.496 target-emotion evidence. The scoring caveat is explicit: this is acoustic similarity to a local Zonos emotion space, not a human emotion recognizer.

Control surface

Which TTS models are experiment-ready?

ModelControlsRole
Zonos8D emotion vector, rate, pitch std, speaker embeddingPrimary open-weight causal test
IndexTTS-28-float emotion vector, text-emotion mode, speaker promptBest cross-check candidate
Chatterboxexaggeration, CFG, temperature, reference audioNonlinearity and stability stress test
OmniVoicepitch, whisper, speed, duration, tagsAcoustic-channel isolation baseline
FastPitch / FastSpeech 2explicit pitch, duration, energyWhite-box mechanism baseline
Gradiumvoice, CFG coefficient, padding bonus, temperature, pacingProduction performance and long-form reading sweep
What this adds to the leaderboard

MOS answers whether a voice sounds natural. This research asks whether a voice can be directed.

For voice agents, audiobooks, tutoring, games, and dubbing, the decisive question is whether the model obeys performance controls without destroying intelligibility, identity, or timing.

Limitations
  • Zonos labels are model-control labels, not direct human affect labels.
  • Kokoro labels are proxy recipes, so high accuracy can mean the classifier learned pacing.
  • Jitter and shimmer are frame-level proxies, not Praat-grade cycle measurements.
  • The Caine comparison scores acoustic similarity to a local emotion space, not acting quality.
Next benchmark

The next step is a public controllability suite.

01
Semantics vs direction
Positive, negative, and neutral sentences crossed with contradictory voice directions.
02
Emotion vs speaker identity
Measure whether strong emotion reduces speaker embedding similarity to a fixed reference.
03
Temporal emotion arcs
Ask a model to move from calm to fear to anger inside one utterance and score window-level smoothness.
04
Prosody hallucinations
Lower stability or raise style strength until tempo, F0, WER, or alignment breaks.
05
Vendor-neutral markup
Draft an Emotional Direction Markup layer and map it to Zonos, Gradium, ElevenLabs, and OmniVoice-style controls.
§ 03 · TTS registry

Registry rows, tiered.

Reported MOS stays visible, but it is not ranked as equivalent to CodeSOTA-measured runs. Sub-0.1 MOS gaps are marked as noise.


For rankable in-repo numbers see the measured leaderboard.

ModelVendorKindVerificationArchitectureParamsMOSMOS noteYear
ElevenLabs Turbo v2.5ElevenLabsCloud APIvendor reportedProprietary (diffusion-based)4.8within MOS noise2024
Sesame CSMSesameOpen Sourcecommunity reportedConversational Speech Model1B+4.7within MOS noise2025
OpenAI TTS HDOpenAICloud APIvendor reportedProprietary4.7within MOS noise2023
Gemini 2.5 Pro TTSGoogleCloud APIvendor reportedMultimodal LLM (native audio)4.7within MOS noise2025
Cartesia Sonic 2CartesiaCloud APIvendor reportedState-space model4.7within MOS noise2025
ElevenLabs Flash v2.5ElevenLabsCloud APIvendor reportedProprietary (optimized)4.6reported MOS; no CodeSOTA CI2025
PlayHT 3.0PlayHTCloud APIvendor reportedProprietary4.6reported MOS; no CodeSOTA CI2025
Fish Audio S2 ProFish AudioOpen Sourcepaper reportedDual-autoregressive transformer + RVQ audio codec5B4.6reported MOS; no CodeSOTA CI2026
Orpheus TTSCanopy LabsOpen Sourcecommunity reportedLLM-based (Llama backbone)3B4.6reported MOS; no CodeSOTA CI2025
Gemini 2.5 Flash TTSGoogleCloud APIvendor reportedMultimodal LLM (native audio)4.5reported MOS; no CodeSOTA CI2025
Kokoro v1.0HexgradOpen Sourcecodesota measuredLightweight autoregressive82M4.5no CI yet; measured run exposes sample count and artifacts2025
XTTS v2CoquiOpen Sourcepaper reportedGPT-like + VITS decoder467M4.5reported MOS; no CodeSOTA CI2024
Google Chirp 3 HDGoogleCloud APIvendor reportedGenerative (USM-based)4.4reported MOS; no CodeSOTA CI2025
Gradium TTSGradiumCloud APIcodesota measuredProprietary neural TTS4.4no CI yet; measured run exposes sample count and artifacts2026
Fish Speech 1.5Fish AudioOpen Sourcecommunity reportedVQGAN + Transformer500M4.4reported MOS; no CodeSOTA CI2025
F5-TTSShanghai AI LabOpen Sourcepaper reportedFlow-matching (non-autoregressive)335M4.4reported MOS; no CodeSOTA CI2024
Dia 1.6BNari LabsOpen Sourcecommunity reportedTransformer + non-verbal tokens1.6B4.3reported MOS; no CodeSOTA CI2025
Spark-TTSSparkAudioOpen Sourcecommunity reportedControllable Transformer500M4.3reported MOS; no CodeSOTA CI2025
Supertonic 3SupertoneOpen Sourcecommunity reportedONNX Runtime local inference99M4.2reported MOS; no CodeSOTA CI2026
Parler-TTSHugging FaceOpen Sourcepaper reportedPrompt-controlled Transformer880M4.1reported MOS; no CodeSOTA CI2025
PiperRhasspyOpen Sourcecommunity reportedVITS (lightweight)~20M3.6reported MOS; no CodeSOTA CI2023
Fig 1 · MOS on a 1-5 scale. Listener panels and reference audio differ across sources; gaps under 0.1 are noise.
§ 04 · Picks

By use-case.

MOS rankings are one axis. The model you actually want depends on what you're building — voice agents, audiobooks, on-device assistants each pull in different directions.

Real-time voice agents

Cartesia Sonic 2

Sub-200ms TTFB with streaming

<90ms TTFB, state-space architecture purpose-built for interactive voice. ElevenLabs Flash v2.5 is the fallback at ~120ms.

Maximum naturalness

ElevenLabs Turbo v2.5

Highest literature MOS

4.8 MOS — widely considered indistinguishable from human. Massive voice library and commercial cloning.

Open-source on-device

Kokoro v1.0

Apache / MIT license, <1B params, CPU-friendly

82M params, Apache-2.0, ~10x real-time on CPU. Tied with commercial APIs on CodeSOTA's independent UTMOS eval (4.48).

Expressive / dialogue

Sesame CSM

Handles emotion, laughter, pauses, multi-speaker

Conversational Speech Model — 4.7 MOS with emotional expressiveness. Dia 1.6B is the alternative for scripted dialogue with non-verbal cues.

Voice cloning (zero-shot)

F5-TTS / XTTS v2

Clone from seconds of reference audio

F5-TTS uses flow-matching for fast cloning; XTTS v2 covers 17 languages. Orpheus TTS is the LLM-based alternative with emotion tags.

Multilingual breadth

Google Chirp 3 HD

30+ languages supported

31 languages, 8 voice personas, instant cloning. Fish Speech 1.5 is the open-source alternative for CJK-heavy deployments.

§ 05 · Open vs cloud

The gap nearly closed.

As of 2026 the quality gap is nearly closed: the best open-source TTS (Sesame CSM, 4.7 MOS) is within 0.1–0.3 of the top commercial API (ElevenLabs Turbo v2.5, 4.8 MOS). The remaining cloud advantages are in infrastructure, not naturalness.

When to go open source
  • Data residency or air-gapped deployment
  • High volume where per-character pricing hurts
  • Custom fine-tuning on your voice or domain
  • Edge / on-device inference (Kokoro, Piper)
  • Full reproducibility for research
When to go cloud API
  • Sub-200ms TTFB streaming (Cartesia, ElevenLabs Flash)
  • Professional voice cloning with licensing support
  • Broad multilingual coverage (Chirp 3, ElevenLabs)
  • Managed infrastructure, SLA, autoscaling
  • You don't want to host a GPU
§ 06
How TTS is scored

MOS, and what it misses.

Human raters listen to generated audio and score naturalness 1 (bad) to 5 (indistinguishable from human). A modern TTS target is 4.5+. Reference recordings of real speech typically score 4.5–4.7 — the ceiling.

Because real MOS studies are slow and expensive, papers often use automatic MOS predictors like UTMOS, trained on crowdsourced ratings. UTMOS correlates around 0.9 with true MOS on TTS-like audio — good enough for ranking, noisy enough that small gaps (<0.1) should not be treated as decisive.

MOS is one axis. The model you actually pick is a function of latency, licence, voice library, and intelligibility on hard text. The best MOS can lose to a worse-MOS model that ships a better voice library and sub-200ms TTFB.

Related

Neighbouring registers.

Independent TTS eval
First-party UTMOS + WER measurement.
Speech hub · STT + TTS
Combined register, 35+ models.
Speech-to-text
The paired STT leaderboard.
Guide · TTS models
Long-form walkthrough of the landscape.
Beta · MOS column is literature-sourced for most rows. See the independent eval for numbers measured in-repo. Feedback to k.wikiel@gmail.com.