Audio-Visual Speech Separation
Separate or enhance speech in videos using both audio and lip cues. Improves meeting transcription, TV/movie captioning, and noisy recordings.
How Audio-Visual Speech Separation Works
A technical deep-dive into audio-visual speech separation. How machines use lip movements to isolate individual voices from a noisy crowd.
The Cocktail Party Problem
Imagine you are at a crowded party. Dozens of conversations overlap, glasses clink, music plays. Yet somehow you can focus on one voice and follow what they are saying. How does the brain do this? And can we teach machines to do it too?
The Challenge
Given only the mixed signal, extract Speaker A voice cleanly. This is source separation.
Why Audio-Only Separation Is Hard
Human voices occupy similar frequency ranges (85-255 Hz fundamental). When two people speak, their harmonics interleave and mask each other.
The mixing is destructive. Information is genuinely lost when waves interfere. Separation requires inferring what was never recorded.
Which output corresponds to which speaker? Without labels, the model cannot know who is who. This is the "permutation ambiguity."
The Key Insight: Humans Use Vision
Watch someone speak at a noisy party. Your brain automatically uses their lip movements to disambiguate their voice from the noise. This is called visual speech perception or the McGurk effect. If we can see who is speaking, we can use that visual signal to guide audio separation. The eyes help the ears.
Visual Cues for Speech
The face, especially the mouth region, contains rich information about what someone is saying. Lip reading research shows that visual speech can be decoded even without audio.
Lip Shapes Encode Phonemes
Different vowels and consonants produce distinct mouth shapes. "A" opens wide, "O" rounds, "M" closes.
What the Model Sees
| Visual Cue | Description | Importance |
|---|---|---|
| Lip Movement | Primary visual signal. Mouth shape correlates with phonemes. | Critical |
| Jaw Motion | Amplitude indicator. Wider jaw = louder speech. | High |
| Facial Landmarks | 68+ points tracking face geometry. | Medium |
| Head Pose | Speaking direction and attention. | Medium |
| Eye Gaze | Turn-taking cues in conversation. | Low |
| Speaker Identity | Face embedding for voice association. | High |
Lip Region Processing Pipeline
Interactive: Separating Speakers
See how visual cues help isolate individual voices from a mixture.
Video Input
Audio Output
How the Separation Works
- 1.The model receives the mixed audio spectrogram and video of all visible faces.
- 2.Lip movements are encoded into a sequence of visual features (one per video frame).
- 3.Audio and visual features are fused, learning which sounds correlate with which lip movements.
- 4.The model predicts a mask for each speaker: which time-frequency bins belong to them.
- 5.Applying the mask to the mixed spectrogram isolates each speaker voice.
Multimodal Fusion Strategies
How do you combine audio and video? The fusion strategy determines how the modalities interact.
Concatenate audio and video features before processing.
Process modalities separately, combine predictions.
Learn to attend across modalities dynamically.
Typical AV Speech Separation Architecture
- - Spectrogram resolution: 25ms window, 10ms hop for AV sync at 25fps video
- - Lip crop size: Typically 96x96 or 112x112 grayscale
- - Mask type: Ratio mask (0-1) or complex ideal ratio mask (cIRM)
- - Loss function: SI-SNR (scale-invariant SNR) or spectrogram L1
Key Research Methods
From the pioneering Looking to Listen to self-supervised AV-HuBERT.
First large-scale AV speech separation. Uses face embeddings + dilated convolutions.
Uses lip reading features + speaker identity. Better generalization to unseen speakers.
Learns AV representations from unlabeled video. State-of-the-art on lip reading.
Extends HuBERT to joint AV learning. Unified representations for multiple tasks.
Benchmark Datasets
| Dataset | Size | Content | Use Case |
|---|---|---|---|
| LRS2 | 225 hours | BBC news, lectures | Lip reading, AV ASR |
| LRS3 | 438 hours | TED talks | Large-scale AV training |
| VoxCeleb2 | 2000+ hours | Celebrity interviews | Speaker recognition, separation |
| AVSpeech | 2800+ hours | YouTube videos | Looking to Listen training |
State-of-the-Art Results (SDRi on LRS2)
SDRi = Signal-to-Distortion Ratio improvement over the mixture. Higher is better. Visual information adds 2-6 dB over audio-only methods.
Code Examples
Implementation snippets for lip detection, AV separation models, and evaluation.
import torch
import torch.nn as nn
import torchaudio
class AudioVisualSeparator(nn.Module):
"""
Simplified Looking to Listen architecture.
The model takes:
- Mixed audio spectrogram: (B, 1, F, T)
- Face embeddings for each speaker: (B, N_speakers, D_face)
Outputs:
- Separated spectrograms for each speaker: (B, N_speakers, F, T)
"""
def __init__(self, n_speakers=2, face_dim=512, audio_channels=257):
super().__init__()
# Face encoder (pretrained, e.g., FaceNet)
self.face_encoder = nn.Sequential(
nn.Linear(face_dim, 256),
nn.ReLU(),
nn.Linear(256, 256)
)
# Audio encoder with dilated convolutions
self.audio_encoder = nn.Sequential(
nn.Conv2d(1, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, padding=2, dilation=2),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 128, kernel_size=3, padding=4, dilation=4),
nn.BatchNorm2d(128),
nn.ReLU()
)
# Audio-visual fusion
self.fusion = nn.Sequential(
nn.Conv2d(128 + 256, 256, kernel_size=1),
nn.ReLU(),
nn.Conv2d(256, 128, kernel_size=1),
nn.ReLU()
)
# Mask prediction (one per speaker)
self.mask_predictor = nn.Sequential(
nn.Conv2d(128, 64, kernel_size=3, padding=1),
nn.ReLU(),
nn.Conv2d(64, n_speakers, kernel_size=1),
nn.Sigmoid()
)
def forward(self, mixed_audio, face_embeddings):
"""
Args:
mixed_audio: (B, 1, F, T) - mixed spectrogram
face_embeddings: (B, N, D) - face embedding per speaker
Returns:
masks: (B, N, F, T) - separation mask per speaker
"""
B, _, F, T = mixed_audio.shape
N = face_embeddings.shape[1]
# Encode audio
audio_features = self.audio_encoder(mixed_audio) # (B, 128, F, T)
# Encode faces and tile across time/frequency
face_features = self.face_encoder(face_embeddings) # (B, N, 256)
# For simplicity, use first speaker's face for conditioning
# Full model would predict separate masks for each speaker
face_tiled = face_features[:, 0, :].unsqueeze(-1).unsqueeze(-1)
face_tiled = face_tiled.expand(-1, -1, F, T) # (B, 256, F, T)
# Fuse modalities
fused = torch.cat([audio_features, face_tiled], dim=1)
fused = self.fusion(fused)
# Predict masks
masks = self.mask_predictor(fused) # (B, N, F, T)
return masks
def separate(self, mixed_audio, face_embeddings):
"""Apply masks to separate audio."""
masks = self.forward(mixed_audio, face_embeddings)
# Apply masks to mixed spectrogram
separated = mixed_audio * masks # (B, N, F, T)
return separated, masksCommon Implementation Pitfalls
- - AV sync: Video 25fps, audio 16kHz must align
- - Face detection failures in low light
- - Lip crop jitter between frames
- - Missing face = fallback to audio-only
- - Use face tracking, not per-frame detection
- - Normalize lip crop position across frames
- - Pretrain face encoder on lip reading task
- - Data augmentation: noise, reverberation
Quick Reference
- - Lip movements reveal what is spoken
- - Visual cues disambiguate mixed audio
- - Solve the permutation problem
- - Looking to Listen (Google, 2018)
- - VisualVoice (MIT, 2021)
- - AV-HuBERT (Meta, 2022)
- - SDRi: Separation quality
- - PESQ: Perceptual quality
- - STOI: Intelligibility
Use Cases
- ✓Meeting transcription cleanup
- ✓Broadcast captioning
- ✓Noisy lecture enhancement
- ✓Courtroom/boardroom audio
Architectural Patterns
Audio-Visual Masking
Use lip regions to condition separation masks on the audio spectrogram.
Speaker Localization + Separation
Track faces, map to speakers, then isolate corresponding audio streams.
Implementations
Open Source
SpeechSplit + Visual Conditioning
Apache 2.0Pair SepFormer with face tracks for better diarization.
NeMo AV-Diarization
Apache 2.0Pipelines that fuse VAD, diarization, and separation with vision cues.
Benchmarks
Quick Facts
- Input
- Video
- Output
- Audio
- Implementations
- 3 open source, 0 API
- Patterns
- 2 approaches
Related Blocks
Have benchmark data?
Help us track the state of the art for audio-visual speech separation.
Submit Results