Hugging Face Transformers packs with Flash Attention 2 via DataCollatorWithFlattening to boost training throughput
AI Impact Summary
Hugging Face Transformers now supports packing instruction-tuning examples without padding using Flash Attention 2, enabled by the new DataCollatorWithFlattening and cu_seqlens handling to maintain example boundaries. In benchmarking across models like Llama 2/3, Mistral, Mixtral, and Granite, throughput improvements reach up to 2x on long, varied-length data (e.g., FLAN) with peak memory reductions around 20%, while longer sequences see more modest gains. Convergence is preserved since minibatch sizes and optimization steps remain unchanged, and TRL users can achieve the same benefits by enabling padding_free with DataCollatorForCompletionOnlyLM. Adoption requires integrating the new data collator and ensuring position_ids exposure where needed, but the gains are most pronounced when training data exhibits wide length variance.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info