BLOOM 176B Training Pipeline: 3D Parallelism using Megatron-DeepSpeed
AI Impact Summary
The article describes training the 176B BLOOM model on Jean Zay with 384 NVIDIA A100 80GB GPUs, using a Megatron-DeepSpeed stack that fuses DeepSpeed ZeRO, Tensor Parallelism and Pipeline Parallelism from Megatron-LM. It details the hardware footprint (NVLink, Omni-Path, GPFS, 2.3TB full-checkpoint weight state, 329GB bf16 subset) and a 3.5-month training period, illustrating the scale and engineering complexity behind state-of-the-art multilingual models. For teams, reproducing or extending BLOOM requires access to HPC-grade interconnects, large GPU memory pools, and a forked Megatron-DeepSpeed workflow; generic cloud training without similar hardware and collaboration will be prohibitively expensive and time-consuming. Business-wise, only organizations with sustained compute budgets and infrastructure partnerships can expect to replicate such scale, limiting practical access to similar capabilities.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info