Accelerate Large Model Training with DeepSpeed ZeRO Stage-2 via Hugging Face Accelerate
AI Impact Summary
The document describes using Hugging Face Accelerate to enable DeepSpeed ZeRO optimization without code changes, specifically demonstrating ZeRO Stage-2 to train a ~900M-parameter DeBERTa-v2-xlarge-mnli model on a single-node with 2x24GB GPUs. It highlights a dramatic memory and throughput advantage: per-GPU batch size jumps from 8 (DDP) to 40, yielding roughly a 3.5x reduction in total training time while maintaining performance on MRPC, by partitioning optimizer states, gradients, and optionally offloading. To operationalize this, teams must configure a DeepSpeed config file and run accelerate config, noting precision considerations (bf16) and potential NaN losses, with a path to scale to other models and hardware using the same Accelerate + DeepSpeed workflow.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info