Speech Emotion Recognition
Classify speaker emotion or affective state from voice.
How Speech Emotion Recognition Works
A technical deep-dive into Speech Emotion Recognition. How machines learn to hear not just what we say, but how we feel when we say it.
The Problem
Why words alone are not enough.
Consider the phrase: "That's great."
Spoken with rising pitch and high energy, it means genuine enthusiasm. With flat pitch and a sigh, it drips with sarcasm. With trembling voice, it might mask disappointment. The same words carry completely different meanings based on how they are spoken.
Speech Emotion Recognition (SER) extracts these paralinguistic cues - the pitch, rhythm, intensity, and voice quality that reveal our emotional state. This goes beyond speech-to-text; it is about understanding the music behind the words.
Detect frustrated callers in real-time. Route to specialists before escalation. Measure emotional journey across interactions.
Screen for depression markers in voice. Track mood over time. Alert caregivers to emotional changes.
Make voice assistants emotionally aware. Adapt responses based on user state. Create more empathetic AI companions.
The Core Challenges
Acoustic Features
The building blocks of emotion in speech. What the model actually "hears".
Fundamental frequency of the voice. Higher when excited or angry, lower when sad.
Loudness of speech. Correlates strongly with arousal and emotional intensity.
Speaking rate and pause patterns. Emotions affect speech rhythm significantly.
Frequency distribution characteristics. Captures voice quality and timbre.
Traditional vs Modern Feature Extraction
- +Interpretable, fast, works with small data
- -Requires domain expertise, less accurate
- +Higher accuracy, learns rich representations
- -Black-box, requires GPU, more data hungry
The Arousal-Valence Model
Beyond discrete labels: representing emotions as continuous dimensions.
Categorical labels like "happy" or "angry" are intuitive but limiting. The circumplex model places emotions on two continuous axes:
How energized or activated the emotional state is.
How positive or negative the emotional state is.
Emotion Circumplex
Continuous prediction allows capturing subtle emotional nuances and transitions.
Discrete Emotion Categories
Interactive Demo
Explore how different utterances map to emotions. Click to see the analysis.
Arousal-Valence Position
Detected Acoustic Features
Models and Methods
From self-supervised transformers to specialized emotion models.
| Model | Type | Architecture | Accuracy | Notes |
|---|---|---|---|---|
| wav2vec2-emotion | Self-supervised | wav2vec2 + classification head | ~75% (4-class) | Fine-tuned wav2vec2-base, good baseline |
| HuBERT-emotion | Self-supervised | HuBERT + pooling + classifier | ~78% (4-class) | Better representations than wav2vec2 |
| emotion2vec | Specialized | Self-supervised pretraining on emotion | ~80% (4-class) | SOTA open-source, Alibaba DAMO |
| SpeechBrain | Toolkit | ECAPA-TDNN, wav2vec2 recipes | ~76% (4-class) | Production-ready, excellent docs |
| Hume AI | API | Proprietary multimodal | 48 emotions + dimensions | Most granular, includes prosody |
| OpenAI Whisper + LLM | Pipeline | ASR -> text -> emotion via LLM | Good for text emotions | Loses acoustic info, text-only analysis |
Self-supervised models pretrained on massive unlabeled speech. Learn rich representations that transfer well to emotion recognition with minimal fine-tuning.
Purpose-built for emotion. Pretrained with emotion-aware objectives on diverse emotion datasets. Current open-source SOTA.
Complete toolkit with pretrained recipes. Includes data loaders, training loops, and evaluation metrics. Excellent documentation.
Commercial API with 48+ fine-grained emotion categories. Includes prosody analysis and multimodal support (face + voice).
Standard Datasets
| Dataset | Size | Emotions | Type | Notes |
|---|---|---|---|---|
| IEMOCAP | 12 hours | 9 | Acted + Improv | Most cited, English |
| RAVDESS | 7356 clips | 8 | Acted | North American, balanced |
| CREMA-D | 7442 clips | 6 | Acted | Diverse actors, video+audio |
| MSP-IMPROV | 8.4 hours | 4 | Improvised | Natural interactions |
| CMU-MOSEI | 65 hours | 6 | In-the-wild | YouTube, multimodal |
| EmoV-DB | 7000 clips | 5 | Acted | Multi-language |
Code Examples
Get started with speech emotion recognition in Python.
from transformers import pipeline
import librosa
# Load pre-trained emotion recognition pipeline
classifier = pipeline(
"audio-classification",
model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
)
# Load audio (16kHz required for wav2vec2)
audio, sr = librosa.load("speech.wav", sr=16000)
# Classify emotion
result = classifier(audio)
for pred in result[:3]:
print(f"{pred['label']:12} {pred['score']:.3f}")
# Output:
# angry 0.743
# sad 0.142
# neutral 0.089Quick Reference
- - emotion2vec (SOTA open-source)
- - Fine-tuned HuBERT
- - Ensemble with text analysis
- - HuggingFace pipeline
- - SpeechBrain pretrained
- - Hume AI (no ML needed)
- - Pitch + energy + tempo = emotion
- - Arousal-valence captures nuance
- - Acted data differs from real
Use Cases
- ✓Call center quality
- ✓Health monitoring
- ✓Gaming NPCs
- ✓Voice analytics
Architectural Patterns
Spectrogram CNN/Transformer
Predict emotion from mel features.
SSL Audio Fine-Tune
Fine-tune wav2vec2/Hubert embeddings for emotion.
Implementations
Open Source
Benchmarks
Quick Facts
- Input
- Audio
- Output
- Structured Data
- Implementations
- 3 open source, 0 API
- Patterns
- 2 approaches
Have benchmark data?
Help us track the state of the art for speech emotion recognition.
Submit Results