InfoCapability

Speculative Decoding for Whisper Inference — 2x Faster with Verified Outputs

AI Impact Summary

Speculative decoding introduces a fast assistant Whisper model to generate candidate tokens which the slower main Whisper model then verifies, delivering roughly a 2x reduction in transcription latency without changing outputs. This can be a drop-in improvement for Whisper pipelines if the assistant model shares the same vocabulary and is significantly faster, with effectiveness tied to the token mix (70-80% easy tokens) and the main model's verification step. Key details include using openai/whisper-large-v2 as a baseline, noting that large-v3’s expanded vocabulary may preclude compatibility, and leveraging HuggingFace Transformers’ assisted generation and SDPA/flash attention. Practical adoption requires careful benchmarking across languages and audio conditions to ensure identical results while realizing the latency gains.

Affected Systems

Whisper large-v3

Date: Date not specified
Change type: capability
Severity: info

Speculative Decoding for Whisper Inference — 2x Faster with Verified Outputs

More from Hugging Face

Get alerts for Hugging Face