InfoCapability

Hugging Face Transformers packs with Flash Attention 2 via DataCollatorWithFlattening to boost training throughput

AI Impact Summary

Hugging Face Transformers now supports packing instruction-tuning examples without padding using Flash Attention 2, enabled by the new DataCollatorWithFlattening and cu_seqlens handling to maintain example boundaries. In benchmarking across models like Llama 2/3, Mistral, Mixtral, and Granite, throughput improvements reach up to 2x on long, varied-length data (e.g., FLAN) with peak memory reductions around 20%, while longer sequences see more modest gains. Convergence is preserved since minibatch sizes and optimization steps remain unchanged, and TRL users can achieve the same benefits by enabling padding_free with DataCollatorForCompletionOnlyLM. Adoption requires integrating the new data collator and ensuring position_ids exposure where needed, but the gains are most pronounced when training data exhibits wide length variance.

Affected Systems

Hugging Face Transformers

Date: Date not specified
Change type: capability
Severity: info

Hugging Face Transformers packs with Flash Attention 2 via DataCollatorWithFlattening to boost training throughput

More from Hugging Face

Get alerts for Hugging Face