InfoCapability

Hugging Face Transformers adds KV Cache Quantization to extend long-context generation for Llama-2 models

AI Impact Summary

Hugging Face's KV Cache Quantization reduces memory usage for the key-value cache during autoregressive generation, enabling longer contexts on GPUs with limited memory. For a 10,000-token context on a 7B Llama-2 model, KV cache alone can require about 5 GB, which this feature can shrink by quantizing older entries while keeping the most recent keys/values in full precision via a residual cache (baseline 128 tokens). The implementation supports per-token quantization for keys/values with backends like quanto (int2/int4) and HQQ (int2/int4/int8), but introduces a per-step quantization/dequantization cost that can impact generation speed. This enables longer generations without hitting memory limits, but teams should tune quantization parameters and residual length to balance memory, speed, and quality when deploying Llama-2 models.

Affected Systems

Llama-2-7B

Date: Date not specified
Change type: capability
Severity: info

Hugging Face Transformers adds KV Cache Quantization to extend long-context generation for Llama-2 models

More from Hugging Face

Get alerts for Hugging Face