Home/Speech/Best for voice cloning
Voice cloningUpdated April 2026

Best TTS for voice cloning

Voice cloning has moved from novelty to infrastructure. Ten seconds of reference audio is enough for most models to produce convincing output. This page compares vendors on fidelity, sample requirements, and consent tooling — because the ethical defaults matter as much as the speaker-similarity score.

TL;DR

  • > Best fidelity: ElevenLabs Professional Voice Clone — 30+ min of studio audio, near-indistinguishable.
  • > Best instant clone: PlayHT 3.0 or ElevenLabs IVC — 60 seconds of input.
  • > Best self-host: F5-TTS (MIT) or XTTS-v2 (CPML). Zero-shot, 10s reference.
  • > Best streaming clone: Cartesia — the only sub-100ms TTFB vendor with cloning.

How voice cloning actually works

Modern cloning is not fine-tuning. A pretrained speaker encoder reads the reference audio and emits a fixed-length embedding that conditions the acoustic model at inference. The acoustic model was already trained on thousands of speakers — it knows how voices differ. The embedding just tells it which voice to render.

Voice cloning

How voice cloning works

Reference audio is encoded into a speaker embedding that conditions the acoustic model at inference.

Reference audio10-30s of the target voiceSpeaker encoderResNet / WavLM / ECAPAtrained on 10k+ speakersaudio → 192-d vectorSpeaker embedding[0.24, -0.81, 0.15, ..., 0.62]Text input“Speak as me.”any text, any languageAcoustic modeltext + embedding → mel(speaker-conditioned)Vocodermel → audioHiFi-GAN / BigVGANCloned voicenew text · same voiceembedding conditions generation

Vendor capability radar

Capability radar

Cloning vendors across six axes

Each axis scored 0-10. Higher is better. Overlay shows trade-offs.

Clone fidelitySample neededLanguagesStreamingCostConsent toolingElevenLabs ProPlayHT 3.0CartesiaF5-TTS (OSS)

Side-by-side

VendorModeMin sampleFidelityStreamingLicense / hosting
ElevenLabs ProfessionalFine-tuned30+ min9.5/10Yes (Flash)Hosted, $99+/mo
ElevenLabs Instant (IVC)Zero-shot60s8.5/10YesHosted, $22+/mo
PlayHT 3.0Zero-shot or fine-tune30-60s9/10YesHosted, $39+/mo
CartesiaZero-shot15s8/10Yes (<100ms)Hosted, usage-based
Google Chirp 3 HDZero-shot (Custom Voice)10s7.5/10YesGCP, usage-based
F5-TTSZero-shot10s8/10LimitedMIT, self-host
XTTS-v2 (Coqui)Zero-shot6s7.5/10NoCPML (research only), self-host
Fish Speech (OpenAudio-S1)Zero-shot10s8/10YesCC-BY-NC, self-host

Reference vs cloned fingerprint

A well-cloned voice matches the reference's formant positions and harmonic density. Cheap clones get the timbre wrong but fake the pitch — audible as an uncanny-valley effect.

Reference · Original speaker (10s sample)
mel spectrogram
8k2k00.0s1.0s2.0s

This voice is a clone trained on a short reference sample. Can you tell it apart?

F5-TTS clone · Generated from 10s reference
mel spectrogram
8k2k00.0s1.0s2.0s

This voice is a clone trained on a short reference sample. Can you tell it apart?

Listen: one reference, four clones

ElevenLabs IVCConsented clone
eleven_multilingual_v2
sample TBD

This voice is a clone trained on a short reference sample. Can you tell it apart?

drop elevenlabs ivc-consented clone.mp3 at /audio/samples/clone-11labs.mp3
PlayHT 3.0Consented clone
Play 3.0 Mini
sample TBD

This voice is a clone trained on a short reference sample. Can you tell it apart?

drop playht 3.0-consented clone.mp3 at /audio/samples/clone-playht.mp3
CartesiaConsented clone
sonic-2
sample TBD

This voice is a clone trained on a short reference sample. Can you tell it apart?

drop cartesia-consented clone.mp3 at /audio/samples/clone-cartesia.mp3
F5-TTSSelf-hosted clone
F5-TTS
sample TBD

This voice is a clone trained on a short reference sample. Can you tell it apart?

drop f5-tts-self-hosted clone.mp3 at /audio/samples/clone-f5tts.mp3

Minimal cloning code

Hosted: ElevenLabs IVC

# ElevenLabs Instant Voice Clone — 60s of consented audio.
from elevenlabs.client import ElevenLabs
client = ElevenLabs(api_key="sk_...")

voice = client.voices.ivc.create(
    name="Alice (consented)",
    files=[open("alice_reading_script.wav", "rb")],
    description="Alice gave written consent on 2026-02-14. See consent-ledger.md.",
)

audio = client.text_to_speech.convert(
    voice_id=voice.voice_id,
    model_id="eleven_multilingual_v2",
    text="This voice is a clone trained on Alice's consented sample.",
)

Self-host: F5-TTS

# F5-TTS — MIT-licensed, flow-matching. Zero-shot cloning from 10s reference.
# pip install f5-tts
from f5_tts.api import F5TTS

model = F5TTS(model_type="F5-TTS", ckpt_file="F5-TTS/ckpts/model.pt")

audio, sr = model.infer(
    ref_file="alice_reference.wav",
    ref_text="This is a reference sample from Alice.",
    gen_text="And this is new text in Alice's voice.",
)

Consent, ethics, and law

Voice cloning without consent is at minimum a civil wrong in most jurisdictions and a crime in some. The 2024 FCC ruling declaring AI-cloned voice robocalls illegal under the TCPA was a warning shot. The EU AI Act classifies voice cloning as limited-risk with mandatory disclosure.

Production minimums

  • Written, timestamped consent tied to the specific reference audio.
  • Per-clone audit log of prompts generated.
  • Watermark output audio (Resemble AI, ElevenLabs AI Speech Classifier, or SileroWM).
  • Rate limiting and prompt-moderation to block impersonation of public figures.
  • “AI-generated voice” disclosure on output, per EU AI Act.

Vendor consent features

  • ElevenLabs: mandatory voice verification via spoken phrase for Pro clones.
  • Google Chirp 3 HD Custom Voice: requires consent statement embedded in reference audio.
  • PlayHT: identity verification + consent form on clone creation.
  • Open-source: no gates. Build your own — or you personally own the liability.

Related