Audio-Text-to-Text.

Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.

Datasets

Results

accuracy

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

VoiceBench

Comprehensive evaluation benchmark for voice agents (LLM-based speech assistants) measuring instruction following, robustness to accents/noise/content variations, and task performance across diverse scenarios.

Primary metric: accuracy

View full leaderboard →

§ 03 · Top 10