InfoCapability

Differential Transformer V2 (DIFF V2) improves inference speed and training stability for production-scale LLMs

AI Impact Summary

Differential Transformer V2 (DIFF V2) advances the DIFF family by increasing the query-head count while keeping KV heads fixed and removing per-head RMSNorm, enabling faster decoding without custom attention kernels. It relies on FlashAttention-compatible kernels and supports long-sequence optimization via YOCO, aligning inference efficiency with baseline Transformers. Production-scale pretraining experiments on dense models and a 30A3 MoE across trillions of tokens show lower LM loss and reduced gradient spikes at high learning rates, indicating improved stability. Adoption will simplify deployment pipelines by eliminating the need for custom kernels, but validation is required across model size, head configuration, and sequence length to ensure the speedups and stability translate in practice.

Affected Systems

Differential Transformer V2 (DIFF V2)

Date: Date not specified
Change type: capability
Severity: info

Differential Transformer V2 (DIFF V2) improves inference speed and training stability for production-scale LLMs

More from Hugging Face

Get alerts for Hugging Face