Text-to-Audio
Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.
AudioCaps (T2A)
AudioCaps captions used as prompts for text-to-audio generation models. Standard eval for AudioLDM, AudioGen, Stable Audio.
Top 10
Leading models on AudioCaps (T2A).
All datasets
1 dataset tracked for this task.
Related tasks
Other tasks in Audio.
Looking to run a model? HuggingFace hosts inference for this task type.