Hugging Face training efficiency through packing with Flash Attention 2 and DataCollatorWithFlattening
AI Impact Summary
Hugging Face now supports packing training examples without padding when using Flash Attention 2, enabled by DataCollatorWithFlattening and boundary-aware packing. The approach concatenates sequences into a single tensor while preserving per-example boundaries, using cu_seqlens to avoid cross-example attention and maintain convergence. In practice, you can see up to 2x training throughput and up to 20% peak memory reduction on heterogeneous-length data (e.g., FLAN), with more modest gains on more uniform datasets like OrcaMath; throughput gains vary by model and sequence-length distribution. Adoption paths include Transformers Trainer and TRL's SFTTrainer, by setting attn_implementation='flash_attention_2' and using DataCollatorWithFlattening, or padding_free=True with DataCollatorForCompletionOnlyLM; supported models include Llama 2/3, Mistral, Mixtral, Granite, DBRX, Falcon, Gemma, OLMo, Phi 1/2/3, phi3, Qwen 2/2 MoE, StableLM, and StarCoder 2.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info