Speech Recognition
Transcribe spoken audio into text. The foundation for voice interfaces, meeting transcription, and audio search.
How Speech Recognition Works
A technical deep-dive into automatic speech recognition. From Whisper to real-time transcription with speaker diarization.
ASR Tasks
Speech recognition includes multiple related tasks beyond basic transcription.
Transcription
Audio to text
Diarization
Who spoke when
Word Timestamps
Precise timing
Translation
Speech to text in another language
The ASR Pipeline
Spectrogram
Model Evolution
From RNN-based models to transformer foundation models.
Whisper Deep-Dive
OpenAI's Whisper is the most widely used ASR model. Trained on 680,000 hours of multilingual data.
Whisper Model Sizes
| Size | Parameters | Relative Speed | VRAM | WER (English) |
|---|---|---|---|---|
| tiny | 39M | 32x | ~1GB | ~7.6% |
| base | 74M | 16x | ~1GB | ~5.0% |
| small | 244M | 6x | ~2GB | ~3.4% |
| medium | 769M | 2x | ~5GB | ~2.5% |
| large-v3 | 1.5B | 1x | ~10GB | ~1.5% |
| turbo | 809M | 8x | ~6GB | ~1.5% |
Key Features
- +99 languages supported
- +Multitask: transcribe, translate, timestamps
- +Robust to noise, accents, background audio
- +Word-level timestamps with alignment
Limitations
- -No speaker diarization (need separate model)
- -30-second processing chunks
- -Can hallucinate on silence/noise
- -Autoregressive = sequential decoding
Whisper Special Tokens
<|startoftranscript|><|en|><|transcribe|><|0.00|>ASR Metrics
How to measure transcription quality.
Word Error Rate (WER)
The standard metric for ASR quality. Lower is better.
Speed Optimization
Techniques for faster transcription.
CTranslate2 backend. 4x faster with int8.
Memory-efficient attention. 2x speedup.
Process multiple chunks in parallel.
Skip silence. Silero VAD integration.
Code Examples
Get started with speech recognition in Python.
import whisper
# Load model (tiny, base, small, medium, large-v3, turbo)
model = whisper.load_model('turbo')
# Transcribe audio file
result = model.transcribe(
'audio.mp3',
language='en', # Optional: auto-detect if not set
task='transcribe', # or 'translate' for speech-to-English
word_timestamps=True, # Get word-level timing
fp16=True # Use FP16 for speed
)
# Results
print(result['text']) # Full transcript
# Word-level timestamps
for segment in result['segments']:
for word in segment.get('words', []):
print(f"{word['start']:.2f}s: {word['word']}")Quick Reference
- - Whisper large-v3
- - Canary-1B (NVIDIA)
- - Whisper turbo
- - Distil-Whisper
- - faster-whisper
- - pyannote 3.1
- - NeMo MSDD
Use Cases
- ✓Meeting transcription
- ✓Voice assistants
- ✓Podcast search
- ✓Call center analytics
Architectural Patterns
End-to-End ASR
Single model that directly maps audio to text (Whisper-style).
- +Simple pipeline
- +Handles accents well
- +Multilingual
- -Can be slow for long audio
- -Needs chunking strategy
Streaming ASR
Real-time transcription with low latency.
- +Live transcription
- +Sub-second latency
- -Slightly lower accuracy
- -More complex deployment
ASR + Diarization Pipeline
Separate speaker identification from transcription.
- +Know who said what
- +Better for meetings
- -Multi-step pipeline
- -Alignment challenges
Implementations
API Services
OpenAI Whisper API
OpenAIwhisper-1 model. Fast, accurate, handles many languages.
Deepgram
DeepgramFast streaming ASR. Nova-2 model. Good for real-time.
AssemblyAI
AssemblyAIBest-in-class for English. Includes diarization, summarization.
Open Source
Whisper (local)
MITRun locally. Large-v3 is best quality, turbo for speed.
Benchmarks
Code Examples
Transcribe with OpenAI Whisper API
Fast cloud transcription with OpenAI
pip install openaifrom openai import OpenAI
client = OpenAI()
# Transcribe audio file
with open('recording.mp3', 'rb') as audio_file:
transcript = client.audio.transcriptions.create(
model='whisper-1',
file=audio_file,
response_format='text'
)
print(transcript)Local Transcription with faster-whisper
4x faster than OpenAI Whisper, runs locally
pip install faster-whisperfrom faster_whisper import WhisperModel
# Load model (use 'large-v3' for best quality, 'base' for speed)
model = WhisperModel('large-v3', device='cuda', compute_type='float16')
# Transcribe
segments, info = model.transcribe('recording.mp3', beam_size=5)
print(f'Detected language: {info.language} ({info.language_probability:.2f})')
print('\nTranscript:')
for segment in segments:
print(f'[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}')Transcribe with Speaker Diarization
Know who said what using pyannote
pip install faster-whisper pyannote.audiofrom faster_whisper import WhisperModel
from pyannote.audio import Pipeline
import torch
# Load models
whisper = WhisperModel('large-v3', device='cuda')
diarization = Pipeline.from_pretrained(
'pyannote/speaker-diarization-3.1',
use_auth_token='YOUR_HF_TOKEN'
)
# Diarize (who speaks when)
audio_file = 'meeting.wav'
diarization_result = diarization(audio_file)
# Transcribe
segments, _ = whisper.transcribe(audio_file)
segments = list(segments)
# Combine: assign speakers to transcript segments
for segment in segments:
# Find speaker at segment midpoint
t = (segment.start + segment.end) / 2
speaker = 'UNKNOWN'
for turn, _, spk in diarization_result.itertracks(yield_label=True):
if turn.start <= t <= turn.end:
speaker = spk
break
print(f'[{speaker}] {segment.text}')Quick Facts
- Input
- Audio
- Output
- Text
- Implementations
- 3 open source, 3 API
- Patterns
- 3 approaches