InfoCapability

Hugging Face training efficiency through packing with Flash Attention 2 and DataCollatorWithFlattening

AI Impact Summary

Hugging Face now supports packing training examples without padding when using Flash Attention 2, enabled by DataCollatorWithFlattening and boundary-aware packing. The approach concatenates sequences into a single tensor while preserving per-example boundaries, using cu_seqlens to avoid cross-example attention and maintain convergence. In practice, you can see up to 2x training throughput and up to 20% peak memory reduction on heterogeneous-length data (e.g., FLAN), with more modest gains on more uniform datasets like OrcaMath; throughput gains vary by model and sequence-length distribution. Adoption paths include Transformers Trainer and TRL's SFTTrainer, by setting attn_implementation='flash_attention_2' and using DataCollatorWithFlattening, or padding_free=True with DataCollatorForCompletionOnlyLM; supported models include Llama 2/3, Mistral, Mixtral, Granite, DBRX, Falcon, Gemma, OLMo, Phi 1/2/3, phi3, Qwen 2/2 MoE, StableLM, and StarCoder 2.

Affected Systems

Flash Attention 2

Date: Date not specified
Change type: capability
Severity: info

Hugging Face training efficiency through packing with Flash Attention 2 and DataCollatorWithFlattening

More from Hugging Face

Get alerts for Hugging Face