InfoCapability

BLOOM 176B Training Pipeline: 3D Parallelism using Megatron-DeepSpeed

AI Impact Summary

The article describes training the 176B BLOOM model on Jean Zay with 384 NVIDIA A100 80GB GPUs, using a Megatron-DeepSpeed stack that fuses DeepSpeed ZeRO, Tensor Parallelism and Pipeline Parallelism from Megatron-LM. It details the hardware footprint (NVLink, Omni-Path, GPFS, 2.3TB full-checkpoint weight state, 329GB bf16 subset) and a 3.5-month training period, illustrating the scale and engineering complexity behind state-of-the-art multilingual models. For teams, reproducing or extending BLOOM requires access to HPC-grade interconnects, large GPU memory pools, and a forked Megatron-DeepSpeed workflow; generic cloud training without similar hardware and collaboration will be prohibitively expensive and time-consuming. Business-wise, only organizations with sustained compute budgets and infrastructure partnerships can expect to replicate such scale, limiting practical access to similar capabilities.

Affected Systems

BLOOM 176B

Date: Date not specified
Change type: capability
Severity: info

BLOOM 176B Training Pipeline: 3D Parallelism using Megatron-DeepSpeed

More from Hugging Face

Get alerts for Hugging Face