OpenAI Whisper: Speculative Decoding for 2x Faster Inference
AI Impact Summary
OpenAI's Whisper model, particularly the large-v3 model, can be significantly accelerated using speculative decoding. This technique, inspired by Google's Fast Inference from Transformers via Speculative Decoding, allows for a 2x speedup in inference time by employing a faster 'assistant' model to generate candidate tokens, which are then verified by the slower 'main' model. This approach is particularly effective when the assistant model can accurately predict the majority of tokens, reducing the verification load on the larger model, and is compatible with the Whisper vocabulary.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info