Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Audio-Text-to-TextHome/Tasks/Multimodal/Audio-Text-to-Text
Multimodal· audio-text-to-text

Audio-Text-to-Text.

Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.

3
Datasets
4
Results
accuracy
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

VoiceBench

Comprehensive evaluation benchmark for voice agents (LLM-based speech assistants) measuring instruction following, robustness to accents/noise/content variations, and task performance across diverse scenarios.

Primary metric: accuracy
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on VoiceBench.

No results yet. Be the first to contribute.

What were you looking for on Audio-Text-to-Text?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

3 datasets tracked for this task.

VoiceBench
CANONICAL
0 results · accuracy
MMAU
4 results · accuracy
Top: Qwen3.5-Omni-Plus 82.2
AudioBench
0 results · accuracy
§ 05 · Related tasks

Other tasks in Multimodal.

Any-to-AnyCross-Modal RetrievalImage CaptioningImage-Text-to-ImageImage-Text-to-TextImage-Text-to-VideoText-to-Image GenerationVideo Understanding
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Audio-Text-to-Text? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.