Speculative Decoding doubles Whisper inference speed with openai/whisper-large-v2 assisted generation
AI Impact Summary
Speculative decoding enables Whisper to achieve approximately 2x faster inference by using a fast assistant model to generate candidate tokens that the slower main model then verifies, yielding identical outputs. Implementing this requires selecting an assistant model that shares the same vocabulary and is at least 3x faster than the main model, with appropriate support in the inference framework (e.g., HuggingFace Transformers' assisted generation strategy and optionally sdpa/Flash Attention). If adopted, pipelines must be migrated to leverage the assisted generation flow and validate output parity across languages, but the potential throughput gains can significantly reduce per-hour transcription costs for Whisper workloads.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info