Fine-tuning Llama-2-70B with PyTorch FSDP: memory strategies and Accelerate config
AI Impact Summary
The article demonstrates fine-tuning Llama-2-70B with PyTorch FSDP, emphasizing parameter/optimizer/gradient sharding and the need to optimize memory usage across 2 nodes with 8x A100-80GB GPUs. It documents three concrete challenges: (1) loading the full 70B weights on every rank causes excessive CPU RAM usage, (2) saving intermediate checkpoints with FULL_STATE_DICT leads to NCCL timeouts, and (3) memory and speed gains come from SHARDED_STATE_DICT, meta-device loading on rank 0, bf16, Flash Attention V2, and gradient checkpointing. It provides actionable steps including configuring Accelerate with an fsdp_config.yaml, using TRANSFORMER_BASED_WRAP, and switching to FULL_STATE_DICT only for final saves while keeping SHARDED_STATE_DICT for intermediates. This implies the need for careful runtime configuration and patching to achieve feasible training and cost efficiency on large-model fine-tuning workflows.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- medium