Accelerate StarCoder on Intel Xeon Optimum: Q8/Q4 quantization and speculative decoding
AI Impact Summary
Intel-based optimization of StarCoder-15B uses 8-bit (INT8) and 4-bit (INT4) quantization with SmoothQuant and groupwise WOQ, plus Assisted Generation (speculative decoding) to reduce autoregressive inference latency. The workflow relies on PyTorch 2.0 and Intel Extension for PyTorch (IPEX) on 4th-gen Intel Xeon with AMX accelerators, delivering ~2.2x TTFT and ~2.2x TPOT improvements at INT8 and ~3.35x TPOT at INT4 while preserving accuracy on MBPP. Production teams should implement a quantization/calibration pipeline and validate against target workloads (e.g., HumanEval/MBPP) to realize the gains, noting that memory bandwidth remains a key bottleneck for token generation. The result is meaningful throughput and latency uplift for code-generation services on Intel hardware, contingent on robust calibration and monitoring to maintain model quality.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info