Multimodalaudio-text-to-text

Audio-Text-to-Text

Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.

2
Datasets
18
Results
accuracy
Canonical metric
Canonical Benchmark

VoiceBench

Comprehensive evaluation benchmark for voice agents (LLM-based speech assistants) measuring instruction following, robustness to accents/noise/content variations, and task performance across diverse scenarios.

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on VoiceBench.

RankModeloverall-scoreYearSource
1
Ultravox-GLM-4P7
88.92026paper
2
Whisper-v3-large + GPT-4o (cascade)
87.82026paper
3
GPT-4o-Audio
86.82026paper
4
Whisper-v3-large + LLaMA-3.1-8B (cascade)
77.52026paper
5
Kimi-Audio
76.92026paper
6
MiniCPM-o
71.22026paper
7
VITA-1.5
64.52026paper
8
Qwen2-Audio
55.82026paper
9
LLaMA-Omni
41.12026paper
10
VITA-1.0
36.42026paper

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Multimodal.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace