Hugging Face Transformers adds KV Cache Quantization to extend context length for LLMs
AI Impact Summary
KV Cache Quantization in Hugging Face Transformers reduces memory usage for the self-attention key-value cache, enabling longer context generation with minimal quality loss by quantizing keys (per-channel) and values (per-token) and keeping a small residual cache. The approach relies on backends like Quanto (int2/int4) and HQQ (int2/int4/int8) and follows a KIVI-inspired methodology, trading memory savings for potential speed and accuracy considerations based on quantization settings. In practice, a 7B Llama-2 model with 10k tokens can require roughly 5 GB for kv-cache storage, so this feature enables longer generations on consumer GPUs while requiring tuning of quantization parameters and residual cache length to balance performance and quality.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info