InfoCapability

Accelerate StarCoder on Intel Xeon Optimum: Q8/Q4 quantization and speculative decoding

AI Impact Summary

Intel-based optimization of StarCoder-15B uses 8-bit (INT8) and 4-bit (INT4) quantization with SmoothQuant and groupwise WOQ, plus Assisted Generation (speculative decoding) to reduce autoregressive inference latency. The workflow relies on PyTorch 2.0 and Intel Extension for PyTorch (IPEX) on 4th-gen Intel Xeon with AMX accelerators, delivering ~2.2x TTFT and ~2.2x TPOT improvements at INT8 and ~3.35x TPOT at INT4 while preserving accuracy on MBPP. Production teams should implement a quantization/calibration pipeline and validate against target workloads (e.g., HumanEval/MBPP) to realize the gains, noting that memory bandwidth remains a key bottleneck for token generation. The result is meaningful throughput and latency uplift for code-generation services on Intel hardware, contingent on robust calibration and monitoring to maintain model quality.

Affected Systems

StarCoder-15B

Date: Date not specified
Change type: capability
Severity: info

Accelerate StarCoder on Intel Xeon Optimum: Q8/Q4 quantization and speculative decoding

More from Hugging Face

Get alerts for Hugging Face