Home/Speech/Best for real-time
Latency firstUpdated April 2026

Best TTS for real-time voice agents

For a voice agent to feel human, you have ~800ms end-to-end between user stopping speech and the agent starting. TTS spends 75-380ms of that. This page ranks streaming TTS by time-to-first-byte and breaks down the full round-trip budget.

TL;DR

  • > Winner (cloud): Cartesia Sonic 2 at ~90ms TTFB, 4.7 MOS. Best balance of quality + latency.
  • > Winner (quality-at-speed): ElevenLabs Flash v2.5 — ~75ms TTFB, 4.55 MOS.
  • > Winner (self-host): Piper running on CPU — ~30ms TTFB, acceptable quality, zero API cost.
  • > Avoid for real-time: OpenAI tts-1-hd (>500ms), Google Studio voices (>500ms), ElevenLabs Turbo (~275ms).

TTFB leaderboard

Time-to-first-byte on each vendor's lowest-latency streaming model. US-East origin, 40-char prompt, measured over 50 calls in April 2026. Piper runs locally on an M2 laptop CPU.

Latency waterfall

Streaming TTFB ranking

The dashed pink line at 200ms is the soft ceiling for pleasant-feeling voice agents.

0ms200ms (voice-bot)400ms600ms800msElevenLabs Flash v2.575msCartesia Sonic 290msDeepgram Aura-2120msAzure Neural HD200msGoogle Chirp 3 HD240msElevenLabs Turbo v2.5275msOpenAI gpt-4o-mini-tts380msPiper local, CPU35msstreamingnon-streaming

The end-to-end round trip

TTS is one of five hops between user voice in and synthesized voice out. Optimizing TTS alone without a fast STT and a streaming LLM is pointless.

Streaming pipeline

Voice-bot round-trip latency

Budget before perceived awkwardness: ~800ms. TTS is one of five hops — optimize the whole pipeline.

🎙︎🔊Mic captureVAD endpointing40msSTTDeepgram Nova-3 / Whisper180msLLMGPT-4o / Claude250msTTS TTFBCartesia Sonic 2 / Flash v2.590msPlayback jitterbuffer + output40msaudio streaming (chunks of 20-40ms)Total end-to-end600ms ✓ under 800ms budget

Quality doesn't have to drop

Fast doesn't mean bad. The upper-left of this plot (fast + natural) is populated: Sonic 2 and Flash v2.5 both clear 4.5 MOS at sub-100ms. Piper is the only budget option; the rest cluster in the 4.2-4.7 range.

Pareto frontier

MOS vs cost, streaming models only

Same data, filtered to streaming-capable options.

$1$3$10$30$100$300Cost per 1M characters (USD, log scale)3.54.04.55.0MOS (1-5)Pareto frontierElevenLabs Flash v2.5Cartesia Sonic 2Deepgram Aura-2Azure Neural HDGoogle Chirp 3 HDOpenAI gpt-4o-mini-ttsPiper local CPUModels

Voice bot reference stack (2026)

Recommended cloud stack

STT
Deepgram Nova-3 streaming (2.2% WER, <200ms)
LLM
GPT-4o or Claude 3.5 Haiku with streaming tokens
TTS
Cartesia Sonic 2 over WebSocket
Transport
WebRTC with OPUS; LiveKit or Pipecat orchestration
Budget
~640ms end-to-end. 160ms headroom.

Recommended self-host stack

STT
Parakeet RNNT 1.1B (1.8% WER, GPU streaming)
LLM
Llama 3.3 70B on vLLM with continuous batching
TTS
Kokoro-82M (GPU) or Piper (CPU)
Transport
Direct WebSocket; in-process audio pipeline
Budget
~500ms if colocated, ~900ms otherwise.

Streaming setup (Cartesia)

# Cartesia Sonic 2 over WebSocket — the lowest-latency setup in production today.
from cartesia import Cartesia
client = Cartesia(api_key="sk_...")
ws = client.tts.websocket()

# Incremental text -> incremental PCM. Send tokens as your LLM produces them.
def stream_tokens(llm_stream):
    for token in llm_stream:
        ws.send(
            model_id="sonic-2",
            transcript=token,
            voice={"mode": "id", "id": "<voice_id>"},
            output_format={"container": "raw", "encoding": "pcm_s16le", "sample_rate": 24000},
            continue_=True,
        )
    ws.send(transcript="", continue_=False)  # flush

    for chunk in ws.receive():
        speaker.write(chunk.audio)

Note the continue_=True flag. You want to send LLM tokens as they arrive rather than waiting for a full sentence — this collapses the LLM→TTS serial delay into a single pipeline.

Listen: same prompt, every vendor

CartesiaNewsreader
sonic-2
sample TBD

Hi, this is Ada from support. I can see your last order was flagged — let me fix that right now.

drop cartesia-newsreader.mp3 at /audio/samples/cartesia-newsreader-rt.mp3
ElevenLabsRachel
eleven_flash_v2_5
sample TBD

Hi, this is Ada from support. I can see your last order was flagged — let me fix that right now.

drop elevenlabs-rachel.mp3 at /audio/samples/elevenlabs-rachel-rt.mp3
DeepgramAsteria
aura-2
sample TBD

Hi, this is Ada from support. I can see your last order was flagged — let me fix that right now.

drop deepgram-asteria.mp3 at /audio/samples/deepgram-asteria-rt.mp3
Azureen-US-JennyNeural
Neural HD
sample TBD

Hi, this is Ada from support. I can see your last order was flagged — let me fix that right now.

drop azure-en-us-jennyneural.mp3 at /audio/samples/azure-jenny-rt.mp3

Related