Home/Building Blocks/Audio-Visual Speech Separation
VideoAudio

Audio-Visual Speech Separation

Separate or enhance speech in videos using both audio and lip cues. Improves meeting transcription, TV/movie captioning, and noisy recordings.

How Audio-Visual Speech Separation Works

A technical deep-dive into audio-visual speech separation. How machines use lip movements to isolate individual voices from a noisy crowd.

1

The Cocktail Party Problem

Imagine you are at a crowded party. Dozens of conversations overlap, glasses clink, music plays. Yet somehow you can focus on one voice and follow what they are saying. How does the brain do this? And can we teach machines to do it too?

The Challenge

A
Speaker A
B
Speaker B
C
Speaker C
+
Background Noise
Music, traffic, HVAC
=
Mixed Signal
Everything overlapped

Given only the mixed signal, extract Speaker A voice cleanly. This is source separation.

Why Audio-Only Separation Is Hard

Spectral Overlap

Human voices occupy similar frequency ranges (85-255 Hz fundamental). When two people speak, their harmonics interleave and mask each other.

No Ground Truth

The mixing is destructive. Information is genuinely lost when waves interfere. Separation requires inferring what was never recorded.

Permutation Problem

Which output corresponds to which speaker? Without labels, the model cannot know who is who. This is the "permutation ambiguity."

The Key Insight: Humans Use Vision

Watch someone speak at a noisy party. Your brain automatically uses their lip movements to disambiguate their voice from the noise. This is called visual speech perception or the McGurk effect. If we can see who is speaking, we can use that visual signal to guide audio separation. The eyes help the ears.

2

Visual Cues for Speech

The face, especially the mouth region, contains rich information about what someone is saying. Lip reading research shows that visual speech can be decoded even without audio.

Lip Shapes Encode Phonemes

A
open
E
wide
I
narrow
O
round
U
pursed
M
closed

Different vowels and consonants produce distinct mouth shapes. "A" opens wide, "O" rounds, "M" closes.

What the Model Sees

Visual CueDescriptionImportance
Lip MovementPrimary visual signal. Mouth shape correlates with phonemes.Critical
Jaw MotionAmplitude indicator. Wider jaw = louder speech.High
Facial Landmarks68+ points tracking face geometry.Medium
Head PoseSpeaking direction and attention.Medium
Eye GazeTurn-taking cues in conversation.Low
Speaker IdentityFace embedding for voice association.High

Lip Region Processing Pipeline

Video Frame
Full face
->
Face Detection
Bounding box
->
Landmark
68 points
->
Lip Crop
96x96 region
->
CNN/Transformer
Visual features
3

Interactive: Separating Speakers

See how visual cues help isolate individual voices from a mixture.

Focus on speaker:

Video Input

Speaker A
Speaker B
Time: 0sFrame: 01s

Audio Output

Mixed Signal
Speaker A (separated)
Speaker B (separated)

How the Separation Works

  1. 1.The model receives the mixed audio spectrogram and video of all visible faces.
  2. 2.Lip movements are encoded into a sequence of visual features (one per video frame).
  3. 3.Audio and visual features are fused, learning which sounds correlate with which lip movements.
  4. 4.The model predicts a mask for each speaker: which time-frequency bins belong to them.
  5. 5.Applying the mask to the mixed spectrogram isolates each speaker voice.
4

Multimodal Fusion Strategies

How do you combine audio and video? The fusion strategy determines how the modalities interact.

Early Fusion

Concatenate audio and video features before processing.

+Simple, learns joint representations
-Modalities must be aligned, loses modality-specific info
Example: Concat spectrogram + lip features -> shared network
Late Fusion

Process modalities separately, combine predictions.

+Modality-specific processing, robust to missing modality
-May miss cross-modal interactions
Example: Audio mask + video mask -> weighted sum
Attention Fusion

Learn to attend across modalities dynamically.

+Flexible, handles variable reliability
-More parameters, harder to train
Example: Cross-modal transformer attention

Typical AV Speech Separation Architecture

Input Streams
Audio Stream
Mixed waveform -> STFT -> Spectrogram (257 x T)
Video Stream (per speaker)
Face crop -> Lip ROI -> CNN -> Features (T x D)
Processing
Audio-Visual Fusion
Concatenate, attention, or FiLM conditioning
Mask Estimation
U-Net / BLSTM predicts T-F mask per speaker
Output
Mask x Mixed -> iSTFT -> Separated waveform
Key Design Choices:
  • - Spectrogram resolution: 25ms window, 10ms hop for AV sync at 25fps video
  • - Lip crop size: Typically 96x96 or 112x112 grayscale
  • - Mask type: Ratio mask (0-1) or complex ideal ratio mask (cIRM)
  • - Loss function: SI-SNR (scale-invariant SNR) or spectrogram L1
5

Key Research Methods

From the pioneering Looking to Listen to self-supervised AV-HuBERT.

Looking to Listen
Google (Ephrat et al.) (2018)
Audio-visual

First large-scale AV speech separation. Uses face embeddings + dilated convolutions.

Architecture: CNN face encoder + audio spectrogram network + fusion
Dataset: AVSpeech (2800+ hours)
VisualVoice
MIT (Gao & Grauman) (2021)
Cross-modal

Uses lip reading features + speaker identity. Better generalization to unseen speakers.

Architecture: Lip reading encoder + speaker embedding + U-Net separator
Dataset: VoxCeleb2, LRS2
AV-HuBERT
Meta (Shi et al.) (2022)
Self-supervised

Learns AV representations from unlabeled video. State-of-the-art on lip reading.

Architecture: Transformer with audio + video streams, masked prediction
Dataset: LRS3, VoxCeleb2
Audio-Visual HuBERT
Meta (Hsu et al.) (2023)
Foundation model

Extends HuBERT to joint AV learning. Unified representations for multiple tasks.

Architecture: Shared transformer backbone, modality-specific heads
Dataset: LRS3, AVSpeech, VoxCeleb

Benchmark Datasets

DatasetSizeContentUse Case
LRS2225 hoursBBC news, lecturesLip reading, AV ASR
LRS3438 hoursTED talksLarge-scale AV training
VoxCeleb22000+ hoursCelebrity interviewsSpeaker recognition, separation
AVSpeech2800+ hoursYouTube videosLooking to Listen training

State-of-the-Art Results (SDRi on LRS2)

Audio-only
~8 dB
Looking to Listen
~10.8 dB
VisualVoice
~12.1 dB
AV-HuBERT
~14.3 dB

SDRi = Signal-to-Distortion Ratio improvement over the mixture. Higher is better. Visual information adds 2-6 dB over audio-only methods.

6

Code Examples

Implementation snippets for lip detection, AV separation models, and evaluation.

Audio-Visual Separator Modelpip install torch torchaudio
PyTorch Model
import torch
import torch.nn as nn
import torchaudio

class AudioVisualSeparator(nn.Module):
    """
    Simplified Looking to Listen architecture.

    The model takes:
    - Mixed audio spectrogram: (B, 1, F, T)
    - Face embeddings for each speaker: (B, N_speakers, D_face)

    Outputs:
    - Separated spectrograms for each speaker: (B, N_speakers, F, T)
    """

    def __init__(self, n_speakers=2, face_dim=512, audio_channels=257):
        super().__init__()

        # Face encoder (pretrained, e.g., FaceNet)
        self.face_encoder = nn.Sequential(
            nn.Linear(face_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256)
        )

        # Audio encoder with dilated convolutions
        self.audio_encoder = nn.Sequential(
            nn.Conv2d(1, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, padding=2, dilation=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, 128, kernel_size=3, padding=4, dilation=4),
            nn.BatchNorm2d(128),
            nn.ReLU()
        )

        # Audio-visual fusion
        self.fusion = nn.Sequential(
            nn.Conv2d(128 + 256, 256, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(256, 128, kernel_size=1),
            nn.ReLU()
        )

        # Mask prediction (one per speaker)
        self.mask_predictor = nn.Sequential(
            nn.Conv2d(128, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, n_speakers, kernel_size=1),
            nn.Sigmoid()
        )

    def forward(self, mixed_audio, face_embeddings):
        """
        Args:
            mixed_audio: (B, 1, F, T) - mixed spectrogram
            face_embeddings: (B, N, D) - face embedding per speaker

        Returns:
            masks: (B, N, F, T) - separation mask per speaker
        """
        B, _, F, T = mixed_audio.shape
        N = face_embeddings.shape[1]

        # Encode audio
        audio_features = self.audio_encoder(mixed_audio)  # (B, 128, F, T)

        # Encode faces and tile across time/frequency
        face_features = self.face_encoder(face_embeddings)  # (B, N, 256)

        # For simplicity, use first speaker's face for conditioning
        # Full model would predict separate masks for each speaker
        face_tiled = face_features[:, 0, :].unsqueeze(-1).unsqueeze(-1)
        face_tiled = face_tiled.expand(-1, -1, F, T)  # (B, 256, F, T)

        # Fuse modalities
        fused = torch.cat([audio_features, face_tiled], dim=1)
        fused = self.fusion(fused)

        # Predict masks
        masks = self.mask_predictor(fused)  # (B, N, F, T)

        return masks

    def separate(self, mixed_audio, face_embeddings):
        """Apply masks to separate audio."""
        masks = self.forward(mixed_audio, face_embeddings)

        # Apply masks to mixed spectrogram
        separated = mixed_audio * masks  # (B, N, F, T)

        return separated, masks

Common Implementation Pitfalls

Watch Out For:
  • - AV sync: Video 25fps, audio 16kHz must align
  • - Face detection failures in low light
  • - Lip crop jitter between frames
  • - Missing face = fallback to audio-only
Best Practices:
  • - Use face tracking, not per-frame detection
  • - Normalize lip crop position across frames
  • - Pretrain face encoder on lip reading task
  • - Data augmentation: noise, reverberation

Quick Reference

The Core Idea
  • - Lip movements reveal what is spoken
  • - Visual cues disambiguate mixed audio
  • - Solve the permutation problem
Key Methods
  • - Looking to Listen (Google, 2018)
  • - VisualVoice (MIT, 2021)
  • - AV-HuBERT (Meta, 2022)
Metrics
  • - SDRi: Separation quality
  • - PESQ: Perceptual quality
  • - STOI: Intelligibility

Use Cases

  • Meeting transcription cleanup
  • Broadcast captioning
  • Noisy lecture enhancement
  • Courtroom/boardroom audio

Architectural Patterns

Audio-Visual Masking

Use lip regions to condition separation masks on the audio spectrogram.

Speaker Localization + Separation

Track faces, map to speakers, then isolate corresponding audio streams.

Implementations

Open Source

AV-Separation (MS3)

MIT
Open Source

Audio-visual speech separation reference implementation.

SpeechSplit + Visual Conditioning

Apache 2.0
Open Source

Pair SepFormer with face tracks for better diarization.

NeMo AV-Diarization

Apache 2.0
Open Source

Pipelines that fuse VAD, diarization, and separation with vision cues.

Benchmarks

Quick Facts

Input
Video
Output
Audio
Implementations
3 open source, 0 API
Patterns
2 approaches

Have benchmark data?

Help us track the state of the art for audio-visual speech separation.

Submit Results