Intel Gaudi faster assisted generation with Speculative Sampling via Optimum Habana
AI Impact Summary
Intel Gaudi gains accelerated text generation through assisted decoding and speculative sampling, now integrated in Optimum Habana and extended Hugging Face libraries. The approach uses a draft model to generate K tokens and evaluate in the target model, with separate KV caches, delivering about 2x speedups for large transformers while preserving sampling quality. The .generate() API gains an optional assistant_model parameter, enabling existing pipelines to adopt the faster path without major refactors. For production, this can lower latency and infrastructure costs, while potentially increasing parallel throughput on Gaudi-based deployments.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info