Speculative Decoding for Whisper Inference — 2x Faster with Verified Outputs
AI Impact Summary
Speculative decoding introduces a fast assistant Whisper model to generate candidate tokens which the slower main Whisper model then verifies, delivering roughly a 2x reduction in transcription latency without changing outputs. This can be a drop-in improvement for Whisper pipelines if the assistant model shares the same vocabulary and is significantly faster, with effectiveness tied to the token mix (70-80% easy tokens) and the main model's verification step. Key details include using openai/whisper-large-v2 as a baseline, noting that large-v3’s expanded vocabulary may preclude compatibility, and leveraging HuggingFace Transformers’ assisted generation and SDPA/flash attention. Practical adoption requires careful benchmarking across languages and audio conditions to ensure identical results while realizing the latency gains.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info