Home/Building Blocks/Speech Emotion Recognition

Audio→Structured Data

Speech Emotion Recognition

Classify speaker emotion or affective state from voice.

How Speech Emotion Recognition Works

A technical deep-dive into Speech Emotion Recognition. How machines learn to hear not just what we say, but how we feel when we say it.

1. The Problem 2. Acoustic Features 3. Arousal-Valence 4. Interactive Demo 5. Models 6. Code

The Problem

Why words alone are not enough.

Consider the phrase: "That's great."

Spoken with rising pitch and high energy, it means genuine enthusiasm. With flat pitch and a sigh, it drips with sarcasm. With trembling voice, it might mask disappointment. The same words carry completely different meanings based on how they are spoken.

Speech Emotion Recognition (SER) extracts these paralinguistic cues - the pitch, rhythm, intensity, and voice quality that reveal our emotional state. This goes beyond speech-to-text; it is about understanding the music behind the words.

Customer Service

Detect frustrated callers in real-time. Route to specialists before escalation. Measure emotional journey across interactions.

Mental Health

Screen for depression markers in voice. Track mood over time. Alert caregivers to emotional changes.

Human-AI Interaction

Make voice assistants emotionally aware. Adapt responses based on user state. Create more empathetic AI companions.

The Core Challenges

1.Subjectivity: Different people express the same emotion differently. Cultural norms vary.

2.Mixed emotions: Real speech often contains multiple simultaneous emotional states.

3.Context dependency: The same acoustic patterns can signal different emotions in different contexts.

4.Acted vs natural: Most training data is acted; real emotions are subtler and more varied.

Acoustic Features

The building blocks of emotion in speech. What the model actually "hears".

Pitch (F0)

Fundamental frequency of the voice. Higher when excited or angry, lower when sad.

Extracted metrics:

Mean F0F0 rangeF0 contourJitter

Example values:

Happy: 180-250Hz | Sad: 100-150Hz

Energy/Intensity

Loudness of speech. Correlates strongly with arousal and emotional intensity.

Extracted metrics:

RMS energyEnergy envelopeShimmerPeak amplitude

Example values:

Anger: +6dB | Fear: +3dB | Sadness: -4dB

Tempo/Duration

Speaking rate and pause patterns. Emotions affect speech rhythm significantly.

Extracted metrics:

Speech ratePause durationSyllable rateRhythm

Example values:

Anger: 5.2 syll/s | Sadness: 3.1 syll/s

Spectral Features

Frequency distribution characteristics. Captures voice quality and timbre.

Extracted metrics:

MFCCsSpectral centroidSpectral fluxFormants

Example values:

13-40 MFCCs + deltas typically used

Traditional vs Modern Feature Extraction

Traditional (Hand-crafted)

Audio -> MFCC + Pitch + Energy -> Statistics -> Classifier

+Interpretable, fast, works with small data
-Requires domain expertise, less accurate

Modern (Self-supervised)

Audio -> wav2vec2/HuBERT -> Embeddings -> Classifier

+Higher accuracy, learns rich representations
-Black-box, requires GPU, more data hungry

The Arousal-Valence Model

Beyond discrete labels: representing emotions as continuous dimensions.

Categorical labels like "happy" or "angry" are intuitive but limiting. The circumplex model places emotions on two continuous axes:

Arousal (Activation)

How energized or activated the emotional state is.

Low (calm, tired)|High (excited, angry)

Valence (Pleasantness)

How positive or negative the emotional state is.

Negative (sad, angry)|Positive (happy, calm)

Emotion Circumplex

High Arousal

Low Arousal

Negative

Positive

Anger

Excitement

Surprise

Sadness

Contentment

Fear

Neutral

Continuous prediction allows capturing subtle emotional nuances and transitions.

Discrete Emotion Categories

Anger

High arousal, negative valence. Fast speech, high pitch variability.

Happiness

High arousal, positive valence. Higher pitch, increased energy.

Sadness

Low arousal, negative valence. Slower tempo, lower pitch.

Fear

High arousal, negative valence. Trembling voice, irregular rhythm.

Surprise

High arousal, variable valence. Sudden pitch rise, increased intensity.

Disgust

Medium arousal, negative valence. Lowered pitch, slower articulation.

Neutral

Low arousal, neutral valence. Baseline for comparison.

Interactive Demo

Explore how different utterances map to emotions. Click to see the analysis.

"I can't believe you did this to me!"

Predicted: anger (92% confidence)

Arousal-Valence Position

High Arousal

Low Arousal

Neg

Pos

Arousal: 0.85

Valence: -0.75

Detected Acoustic Features

pitchHigh

energyVery High

tempoFast

variationHigh

Models and Methods

From self-supervised transformers to specialized emotion models.

Model	Type	Architecture	Accuracy	Notes
wav2vec2-emotion	Self-supervised	wav2vec2 + classification head	~75% (4-class)	Fine-tuned wav2vec2-base, good baseline
HuBERT-emotion	Self-supervised	HuBERT + pooling + classifier	~78% (4-class)	Better representations than wav2vec2
emotion2vec	Specialized	Self-supervised pretraining on emotion	~80% (4-class)	SOTA open-source, Alibaba DAMO
SpeechBrain	Toolkit	ECAPA-TDNN, wav2vec2 recipes	~76% (4-class)	Production-ready, excellent docs
Hume AI	API	Proprietary multimodal	48 emotions + dimensions	Most granular, includes prosody
OpenAI Whisper + LLM	Pipeline	ASR -> text -> emotion via LLM	Good for text emotions	Loses acoustic info, text-only analysis

wav2vec2 / HuBERT

Self-supervised models pretrained on massive unlabeled speech. Learn rich representations that transfer well to emotion recognition with minimal fine-tuning.

Best for: General-purpose, good baselines, multi-language support

emotion2vec

Purpose-built for emotion. Pretrained with emotion-aware objectives on diverse emotion datasets. Current open-source SOTA.

Best for: Maximum accuracy, production deployments

SpeechBrain

Complete toolkit with pretrained recipes. Includes data loaders, training loops, and evaluation metrics. Excellent documentation.

Best for: Research, custom training, full control

Hume AI

Commercial API with 48+ fine-grained emotion categories. Includes prosody analysis and multimodal support (face + voice).

Best for: Production apps, granular emotions, no ML expertise needed

Standard Datasets

Dataset	Size	Emotions	Type	Notes
IEMOCAP	12 hours	9	Acted + Improv	Most cited, English
RAVDESS	7356 clips	8	Acted	North American, balanced
CREMA-D	7442 clips	6	Acted	Diverse actors, video+audio
MSP-IMPROV	8.4 hours	4	Improvised	Natural interactions
CMU-MOSEI	65 hours	6	In-the-wild	YouTube, multimodal
EmoV-DB	7000 clips	5	Acted	Multi-language

Code Examples

Get started with speech emotion recognition in Python.

HuggingFace Transformerspip install transformers librosa

Quick Start

from transformers import pipeline
import librosa

# Load pre-trained emotion recognition pipeline
classifier = pipeline(
    "audio-classification",
    model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
)

# Load audio (16kHz required for wav2vec2)
audio, sr = librosa.load("speech.wav", sr=16000)

# Classify emotion
result = classifier(audio)

for pred in result[:3]:
    print(f"{pred['label']:12} {pred['score']:.3f}")

# Output:
# angry        0.743
# sad          0.142
# neutral      0.089