Home/Building Blocks/Keyword Spotting
AudioStructured Data

Keyword Spotting

Detect wake words and short commands with low latency and tiny footprints.

How Keyword Spotting Works

A technical deep-dive into wake word detection. From the power constraints that shaped the field to the streaming architectures that make "Hey Siri" feel instantaneous.

1

The Problem: Always Listening, Never Draining

You want your device to respond the instant you say its name. But running full speech recognition 24/7 would drain the battery in hours. The solution is two-stage detection: a tiny, always-on "spotter" waits for just your wake word, then hands off to the heavy ASR engine.

Interactive Demo: Keyword Detection in Action

"Hey Jarvis"
Detection Confidence0%

Watch how the model's confidence spikes only during the keyword region. The smoothing prevents false triggers from momentary high scores.

Popular Wake Words and Their Power Budgets

"Hey Siri"
Apple
~1mW
"Alexa"
Amazon
~1.5mW
"OK Google"
Google
~2mW
"Hey Cortana"
Microsoft
~1.5mW
"Computer"
Star Trek
Custom

The Two-Stage Architecture

Stage 1: KWS
Always on
1-5 mW
~50KB model
->
Wake Event
Triggers ASR
->
Stage 2: ASR
On-demand
500+ mW
~100MB model

The KWS model runs continuously on a low-power DSP or neural accelerator. The main CPU and ASR engine only wake up when the keyword is detected.

2

The Constraints That Shape Everything

Keyword spotting operates under severe constraints that full ASR systems never face. Every design decision balances power, latency, accuracy, and model size.

[POWER]
Power Budget

Must run on <5mW to enable months of battery life

Target: <5mWTypical ASR uses 500mW+
[TIME]
Latency

Detection must feel instant (<200ms from utterance end)

Target: <200msUsers expect immediate response
[TARGET]
Accuracy

High recall (don't miss wake words) with low false accepts

Target: >95% recall<1 false accept per day
[SIZE]
Model Size

Must fit in <100KB for embedded deployment

Target: <100KBWhisper tiny is 39MB

Power Budget Reality Check

Full Whisper
Running continuously
Power: ~500mW (GPU) or ~100mW (NPU)
Battery: ~2-4 hours
On-device ASR
Optimized for mobile
Power: ~50mW
Battery: ~8-12 hours
KWS on DSP
Always-on detector
Power: ~1-5mW
Battery: ~months

The Accuracy Trade-off: Recall vs False Accepts

High Recall (Don't Miss)

Users get frustrated if they have to repeat the wake word. Target: >95% detection rate even in noisy environments, accents, and varying speaking styles.

Recall = TP / (TP + FN)
Low False Accepts (Don't Annoy)

Nothing's worse than your device randomly activating. Target: <1 false activation per day during normal conversation and media playback.

False Accept Rate = FP / Total Negatives

The sensitivity parameter lets users trade off between these. Higher sensitivity catches more true activations but also more false ones. Most systems default to ~0.5.

3

Feature Extraction: MFCC and Beyond

Raw audio has 16,000 samples per second. We need a compact representation that captures what matters for keyword recognition while being cheap to compute.

The MFCC Pipeline

MFCCs have been the workhorse of speech processing for decades. They compress audio into ~13 numbers per frame while preserving the information that distinguishes phonemes.

Audio
16kHz samples
1 sec = 16000 values
->
Frame
25ms windows
10ms hop = 100 frames/sec
->
FFT
Power spectrum
512 frequency bins
->
Mel Filter
26 filterbanks
Matches human hearing
->
Log
Compress dynamics
dB scale
->
DCT
Decorrelate
Keep first 13 coefficients
Result: 1 second of audio (16,000 values) -> 100 frames x 13 MFCCs = 1,300 values
12x compression with minimal information loss

Feature Extraction Methods

MFCC
Mel-Frequency Cepstral Coefficients
13-40 per frame
Compute: Low

The classic choice for keyword spotting. Compact representation (13-40 coefficients) that captures vocal tract shape while being robust to volume changes.

+Very compact
+Well understood
+CPU efficient
-Loses some fine detail
-Fixed window size
Log-Mel
Log Mel-Filterbank Energies
40-80 per frame
Compute: Low

Direct log energies from mel filterbanks. More information than MFCCs but larger. Used by modern neural approaches.

+More information retained
+Better for CNNs
+Standard for transformers
-Larger feature vectors
-More sensitive to volume
Raw Waveform
Direct Audio Samples
16000 samples/sec
Compute: High

Let the neural network learn features from raw audio. Requires more data and compute but can discover optimal representations.

+No information loss
+Model learns optimal features
-Needs more training data
-Higher compute
-Harder to interpret
[i]
Practical Recommendation

For most embedded KWS applications, MFCCs with 13 coefficients remain the best choice. They are compact, cheap to compute, and well-supported by every framework. Use 40 log-mel filterbanks only if you have compute budget for larger CNN/Transformer models.

4

Small Footprint Model Architectures

The key insight: we are not trying to transcribe arbitrary speech. We only need to recognize 1-10 specific phrases. This dramatically simplifies the model architecture.

Depthwise Separable CNN: The Workhorse

Standard convolution computes all filter-channel combinations at once. Depthwise separable convolution factorizes this into two steps, dramatically reducing parameters and compute.

Standard Convolution

K filters, each of size (H x W x C_in)

Params: K * H * W * C_in
For 64 3x3 filters on 64 channels:
64 * 3 * 3 * 64 = 36,864 params
Depthwise Separable

Depthwise (H x W per channel) + Pointwise (1x1)

Params: (H * W * C_in) + (K * C_in)
Same 64 filters on 64 channels:
(3 * 3 * 64) + (64 * 64) = 4,672 params
~8x parameter reduction with minimal accuracy loss

Architecture Comparison

ArchitectureParametersMACsLatencyAccuracyNotes
DS-CNN
Depthwise Separable CNN
~20-100KB~5-20M~5ms~95%Splits convolution into depthwise (spatial) and pointwise (channel) operations. Dramatic parameter reduction with minimal accuracy loss.
DSCNN-L
Large DS-CNN
~500KB~50M~15ms~97%Scaled up depthwise separable CNN with more layers and channels. Better accuracy at cost of size.
TC-ResNet
Temporal Convolution ResNet
~300KB~30M~10ms~96%1D convolutions along time axis with residual connections. Excellent for capturing temporal patterns in speech.
Attention RNN
LSTM with Attention
~200KB~40M~20ms~95%Recurrent architecture with attention mechanism. Good for variable-length keywords but harder to optimize.
MatchboxNet
NVIDIA MatchboxNet
~75KB~10M~8ms~97%QuartzNet-style architecture scaled for embedded. Jasper/QuartzNet blocks with 1D convolutions.
Conformer-S
Small Streaming Conformer
~1MB~100M~30ms~98%Hybrid attention-convolution architecture adapted for streaming. State-of-the-art accuracy but higher cost.
Microcontroller (Cortex-M4)

Extreme constraint: <100KB, <10ms

  • - DS-CNN (small)
  • - TFLite Micro
  • - 13 MFCCs
Mobile/Edge (NPU)

Balanced: <500KB, <20ms

  • - DS-CNN (large) or TC-ResNet
  • - ONNX Runtime
  • - 40 log-mel
Cloud/Server

Maximum accuracy: size flexible

  • - Conformer or attention models
  • - PyTorch/TensorFlow
  • - 80 log-mel or raw waveform
5

Streaming Inference: Real-time Detection

Keywords do not arrive in neat 1-second chunks. They can start at any moment and span chunk boundaries. Streaming inference processes audio continuously with a sliding window, maintaining state between chunks.

The Streaming Pipeline

Audio Capture
Microphone input
16kHz, 16-bit PCM
->
Ring Buffer
Sliding window
1-2 second window
->
Frame Extract
Overlapping frames
25ms frames, 10ms hop
->
Feature Compute
MFCC/Mel extraction
13-40 features
->
Neural Network
Classification
DS-CNN, ~5ms
->
Smoothing
Confidence filter
Avoid spurious triggers
->
Wake Event
Trigger callback
Start ASR pipeline

The Ring Buffer: Why It Matters

Imagine the user says "Hey Jarvis" right at the boundary between two audio chunks. If we only process each chunk independently, we would miss the keyword because half of it is in each chunk.

Solution: Sliding Window
  • 1. Maintain a ring buffer of ~1-2 seconds of audio
  • 2. On each new chunk, slide the window forward
  • 3. Run inference on the entire window
  • 4. The keyword is always fully contained in some window
Chunk 1: [.......Hey Jar]
Chunk 2: [vis.......]
--- With ring buffer ---
Window: [...Hey Jarvis...]
Confidence Smoothing

A single high-confidence frame could be noise. Require N consecutive frames above threshold before triggering.

if all(conf > threshold for conf in last_N_frames):
  trigger_wake()
Cooldown Period

After a detection, suppress triggers for 2-3 seconds to prevent the same keyword from triggering multiple times.

if time_since_last_trigger > cooldown_ms:
  allow_trigger()
6

KWS Systems and Frameworks

From open-source community projects to commercial solutions. Choose based on your customization needs, deployment target, and budget.

SystemTypeKeywordsSpeedSizeNotes
OpenWakeWordOpen SourceCustom trainable~5ms per inference~1.5MB per modelPython/ONNX, easy custom keyword training, community models available
PorcupineCommercialCustom trainable~2ms per inference~2MB per modelPicovoice product, free tier, many languages, on-device
SnowboyOpen SourceCustom trainable~5ms per inference~1MB per modelDeprecated but still used, Raspberry Pi compatible
Mycroft PreciseOpen SourceCustom trainable~10ms per inference~500KB per modelTensorFlow Lite, Mycroft assistant, Python
TFLite MicroFrameworkTrain your own~5-20ms~20-100KBGoogle's microcontroller ML, runs on Cortex-M4+
Google Speech CommandsPre-trained35 fixed commands~10ms~500KBYes/No/Up/Down/etc, benchmark standard
For Hobbyist Projects

Start with OpenWakeWord. It's free, easy to train custom keywords, and has pre-built models for common wake words.

Python, ONNX, Raspberry Pi compatible
For Production Apps

Porcupine offers the best balance of accuracy, latency, and cross-platform support. Free tier available.

Mobile, embedded, desktop, 30+ languages
For Maximum Control

Train your own with TFLite Micro. Full control over architecture, training data, and deployment.

Microcontrollers, TensorFlow ecosystem
For Research/Benchmarking

Google Speech Commands dataset with 35 keywords is the standard benchmark for comparing KWS architectures.

65,000 one-second clips, 35 classes
7

Code Examples

Get started with keyword spotting in Python. From pre-trained models to training your own.

OpenWakeWord (Open Source)pip install openwakeword
Recommended
from openwakeword import Model
import pyaudio
import numpy as np

# Load OpenWakeWord model
model = Model(
    wakeword_models=["hey_jarvis"],  # Use built-in or custom model
    inference_framework="onnx"
)

# Audio stream settings
CHUNK = 1280  # ~80ms at 16kHz (model expects this)
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000

p = pyaudio.PyAudio()
stream = p.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=RATE,
    input=True,
    frames_per_buffer=CHUNK
)

print("Listening for wake word...")

try:
    while True:
        # Read audio chunk
        audio_bytes = stream.read(CHUNK)
        audio_array = np.frombuffer(audio_bytes, dtype=np.int16)

        # Run wake word detection
        prediction = model.predict(audio_array)

        # Check if any wake word detected
        for wake_word, scores in prediction.items():
            score = scores[-1]  # Latest prediction
            if score > 0.5:
                print(f"Wake word detected: {wake_word} ({score:.2%})")
                # Trigger your ASR pipeline here

except KeyboardInterrupt:
    pass
finally:
    stream.stop_stream()
    stream.close()
    p.terminate()

# Training custom wake word:
# 1. Collect 3-5 positive samples of your wake word
# 2. Use OpenWakeWord's training script
# 3. Fine-tune on your recordings
# 4. Export to ONNX for deployment

Quick Reference

For Getting Started
  • - OpenWakeWord for custom keywords
  • - Google Speech Commands dataset
  • - 13 MFCCs, DS-CNN architecture
For Production
  • - Porcupine for cross-platform
  • - TFLite Micro for MCUs
  • - Streaming with ring buffer
Key Numbers
  • - Power: <5mW target
  • - Latency: <200ms
  • - Model: <100KB
  • - Accuracy: >95% recall

Use Cases

  • Voice wake word
  • On-device commands
  • Industrial alarms
  • Assistive devices

Architectural Patterns

Tiny CNN on MFCCs

Lightweight conv models on spectrograms.

Streaming Transformers

Low-latency attention for continuous audio.

Implementations

API Services

Picovoice Porcupine

Picovoice
API

Commercial-grade embedded KWS.

Open Source

Google Speech Commands KWS

Apache 2.0
Open Source

Reference CNN for KWS.

openWakeWord

MPL-2.0
Open Source

On-device wake word models.

Benchmarks

Quick Facts

Input
Audio
Output
Structured Data
Implementations
2 open source, 1 API
Patterns
2 approaches

Have benchmark data?

Help us track the state of the art for keyword spotting.

Submit Results