KV Caching in nanoVLM yields 38% generation speedup
AI Impact Summary
nanoVLM introduces a per-layer KV cache in the self-attention blocks to reuse previously computed keys and values across generation steps. After an initial prefill, only the new token's K/V are computed and appended, turning generation into an incremental update and delivering a 38% speedup in generation. The implementation uses per-layer caches with dictionaries holding 'key' and 'value' tensors, and requires careful handling of prefill vs decode and rotary positional encodings via start_pos. This optimization improves throughput at the cost of additional memory for the cached K/V, with memory growth tied to model depth and generated sequence length.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info