Home/Building Blocks/Audio-Visual Speech Separation

Video→Audio

Audio-Visual Speech Separation

Separate or enhance speech in videos using both audio and lip cues. Improves meeting transcription, TV/movie captioning, and noisy recordings.

How Audio-Visual Speech Separation Works

A technical deep-dive into audio-visual speech separation. How machines use lip movements to isolate individual voices from a noisy crowd.

1. Cocktail Party 2. Visual Cues 3. Interactive Demo 4. Multimodal Fusion 5. Methods 6. Code

The Cocktail Party Problem

Imagine you are at a crowded party. Dozens of conversations overlap, glasses clink, music plays. Yet somehow you can focus on one voice and follow what they are saying. How does the brain do this? And can we teach machines to do it too?

The Challenge

Speaker A

Speaker B

Speaker C

Background Noise

Music, traffic, HVAC

Mixed Signal

Everything overlapped

Given only the mixed signal, extract Speaker A voice cleanly. This is source separation.

Why Audio-Only Separation Is Hard

Spectral Overlap

Human voices occupy similar frequency ranges (85-255 Hz fundamental). When two people speak, their harmonics interleave and mask each other.

No Ground Truth

The mixing is destructive. Information is genuinely lost when waves interfere. Separation requires inferring what was never recorded.

Permutation Problem

Which output corresponds to which speaker? Without labels, the model cannot know who is who. This is the "permutation ambiguity."

The Key Insight: Humans Use Vision

Watch someone speak at a noisy party. Your brain automatically uses their lip movements to disambiguate their voice from the noise. This is called visual speech perception or the McGurk effect. If we can see who is speaking, we can use that visual signal to guide audio separation. The eyes help the ears.

Visual Cues for Speech

The face, especially the mouth region, contains rich information about what someone is saying. Lip reading research shows that visual speech can be decoded even without audio.

Lip Shapes Encode Phonemes

open

wide

narrow

round

pursed

closed

Different vowels and consonants produce distinct mouth shapes. "A" opens wide, "O" rounds, "M" closes.

What the Model Sees

Visual Cue	Description	Importance
Lip Movement	Primary visual signal. Mouth shape correlates with phonemes.	Critical
Jaw Motion	Amplitude indicator. Wider jaw = louder speech.	High
Facial Landmarks	68+ points tracking face geometry.	Medium
Head Pose	Speaking direction and attention.	Medium
Eye Gaze	Turn-taking cues in conversation.	Low
Speaker Identity	Face embedding for voice association.	High

Lip Region Processing Pipeline

Video Frame

Full face

Face Detection

Bounding box

Landmark

68 points

Lip Crop

96x96 region

CNN/Transformer

Visual features

Interactive: Separating Speakers

See how visual cues help isolate individual voices from a mixture.

Focus on speaker:

Show visual cues (lip sync)

Video Input

Speaker A

Speaker B

Time: 0sFrame: 01s

Audio Output

Mixed Signal

Speaker A (separated)

Speaker B (separated)

How the Separation Works

1.The model receives the mixed audio spectrogram and video of all visible faces.
2.Lip movements are encoded into a sequence of visual features (one per video frame).
3.Audio and visual features are fused, learning which sounds correlate with which lip movements.
4.The model predicts a mask for each speaker: which time-frequency bins belong to them.
5.Applying the mask to the mixed spectrogram isolates each speaker voice.

Multimodal Fusion Strategies

How do you combine audio and video? The fusion strategy determines how the modalities interact.

Early Fusion

Concatenate audio and video features before processing.

+Simple, learns joint representations

-Modalities must be aligned, loses modality-specific info

Example: Concat spectrogram + lip features -> shared network

Late Fusion

Process modalities separately, combine predictions.

+Modality-specific processing, robust to missing modality

-May miss cross-modal interactions

Example: Audio mask + video mask -> weighted sum

Attention Fusion

Learn to attend across modalities dynamically.

+Flexible, handles variable reliability

-More parameters, harder to train

Example: Cross-modal transformer attention

Typical AV Speech Separation Architecture

Input Streams

Audio Stream

Mixed waveform -> STFT -> Spectrogram (257 x T)

Video Stream (per speaker)

Face crop -> Lip ROI -> CNN -> Features (T x D)

Processing

Audio-Visual Fusion

Concatenate, attention, or FiLM conditioning

Mask Estimation

U-Net / BLSTM predicts T-F mask per speaker

Output

Mask x Mixed -> iSTFT -> Separated waveform

Key Design Choices:

- Spectrogram resolution: 25ms window, 10ms hop for AV sync at 25fps video
- Lip crop size: Typically 96x96 or 112x112 grayscale
- Mask type: Ratio mask (0-1) or complex ideal ratio mask (cIRM)
- Loss function: SI-SNR (scale-invariant SNR) or spectrogram L1

Key Research Methods

From the pioneering Looking to Listen to self-supervised AV-HuBERT.

Looking to Listen

Google (Ephrat et al.) (2018)

Audio-visual

First large-scale AV speech separation. Uses face embeddings + dilated convolutions.

Architecture: CNN face encoder + audio spectrogram network + fusion

Dataset: AVSpeech (2800+ hours)

VisualVoice

MIT (Gao & Grauman) (2021)

Cross-modal

Uses lip reading features + speaker identity. Better generalization to unseen speakers.

Architecture: Lip reading encoder + speaker embedding + U-Net separator

Dataset: VoxCeleb2, LRS2

AV-HuBERT

Meta (Shi et al.) (2022)

Self-supervised

Learns AV representations from unlabeled video. State-of-the-art on lip reading.

Architecture: Transformer with audio + video streams, masked prediction

Dataset: LRS3, VoxCeleb2

Audio-Visual HuBERT

Meta (Hsu et al.) (2023)

Foundation model

Extends HuBERT to joint AV learning. Unified representations for multiple tasks.

Architecture: Shared transformer backbone, modality-specific heads

Dataset: LRS3, AVSpeech, VoxCeleb

Benchmark Datasets

Dataset	Size	Content	Use Case
LRS2	225 hours	BBC news, lectures	Lip reading, AV ASR
LRS3	438 hours	TED talks	Large-scale AV training
VoxCeleb2	2000+ hours	Celebrity interviews	Speaker recognition, separation
AVSpeech	2800+ hours	YouTube videos	Looking to Listen training

State-of-the-Art Results (SDRi on LRS2)

Audio-only

~8 dB

Looking to Listen

~10.8 dB

VisualVoice

~12.1 dB

AV-HuBERT

~14.3 dB

SDRi = Signal-to-Distortion Ratio improvement over the mixture. Higher is better. Visual information adds 2-6 dB over audio-only methods.

Code Examples

Implementation snippets for lip detection, AV separation models, and evaluation.

Audio-Visual Separator Modelpip install torch torchaudio

PyTorch Model

import torch
import torch.nn as nn
import torchaudio

class AudioVisualSeparator(nn.Module):
    """
    Simplified Looking to Listen architecture.

    The model takes:
    - Mixed audio spectrogram: (B, 1, F, T)
    - Face embeddings for each speaker: (B, N_speakers, D_face)

    Outputs:
    - Separated spectrograms for each speaker: (B, N_speakers, F, T)
    """

    def __init__(self, n_speakers=2, face_dim=512, audio_channels=257):
        super().__init__()

        # Face encoder (pretrained, e.g., FaceNet)
        self.face_encoder = nn.Sequential(
            nn.Linear(face_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256)
        )

        # Audio encoder with dilated convolutions
        self.audio_encoder = nn.Sequential(
            nn.Conv2d(1, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, padding=2, dilation=2),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, 128, kernel_size=3, padding=4, dilation=4),
            nn.BatchNorm2d(128),
            nn.ReLU()
        )

        # Audio-visual fusion
        self.fusion = nn.Sequential(
            nn.Conv2d(128 + 256, 256, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(256, 128, kernel_size=1),
            nn.ReLU()
        )

        # Mask prediction (one per speaker)
        self.mask_predictor = nn.Sequential(
            nn.Conv2d(128, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, n_speakers, kernel_size=1),
            nn.Sigmoid()
        )

    def forward(self, mixed_audio, face_embeddings):
        """
        Args:
            mixed_audio: (B, 1, F, T) - mixed spectrogram
            face_embeddings: (B, N, D) - face embedding per speaker

        Returns:
            masks: (B, N, F, T) - separation mask per speaker
        """
        B, _, F, T = mixed_audio.shape
        N = face_embeddings.shape[1]

        # Encode audio
        audio_features = self.audio_encoder(mixed_audio)  # (B, 128, F, T)

        # Encode faces and tile across time/frequency
        face_features = self.face_encoder(face_embeddings)  # (B, N, 256)

        # For simplicity, use first speaker's face for conditioning
        # Full model would predict separate masks for each speaker
        face_tiled = face_features[:, 0, :].unsqueeze(-1).unsqueeze(-1)
        face_tiled = face_tiled.expand(-1, -1, F, T)  # (B, 256, F, T)

        # Fuse modalities
        fused = torch.cat([audio_features, face_tiled], dim=1)
        fused = self.fusion(fused)

        # Predict masks
        masks = self.mask_predictor(fused)  # (B, N, F, T)

        return masks

    def separate(self, mixed_audio, face_embeddings):
        """Apply masks to separate audio."""
        masks = self.forward(mixed_audio, face_embeddings)

        # Apply masks to mixed spectrogram
        separated = mixed_audio * masks  # (B, N, F, T)

        return separated, masks