Hugging Face Transformers adds KV Cache Quantization to extend long-context generation for Llama-2 models
AI Impact Summary
Hugging Face's KV Cache Quantization reduces memory usage for the key-value cache during autoregressive generation, enabling longer contexts on GPUs with limited memory. For a 10,000-token context on a 7B Llama-2 model, KV cache alone can require about 5 GB, which this feature can shrink by quantizing older entries while keeping the most recent keys/values in full precision via a residual cache (baseline 128 tokens). The implementation supports per-token quantization for keys/values with backends like quanto (int2/int4) and HQQ (int2/int4/int8), but introduces a per-step quantization/dequantization cost that can impact generation speed. This enables longer generations without hitting memory limits, but teams should tune quantization parameters and residual length to balance memory, speed, and quality when deploying Llama-2 models.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info