InfoCapability

KV Caching in nanoVLM yields 38% generation speedup

AI Impact Summary

nanoVLM introduces a per-layer KV cache in the self-attention blocks to reuse previously computed keys and values across generation steps. After an initial prefill, only the new token's K/V are computed and appended, turning generation into an incremental update and delivering a 38% speedup in generation. The implementation uses per-layer caches with dictionaries holding 'key' and 'value' tensors, and requires careful handling of prefill vs decode and rotary positional encodings via start_pos. This optimization improves throughput at the cost of additional memory for the cached K/V, with memory growth tied to model depth and generated sequence length.

Affected Systems

nanoVLMVision Language Model

Date: Date not specified
Change type: capability
Severity: info

KV Caching in nanoVLM yields 38% generation speedup

More from Hugging Face

Get alerts for Hugging Face