MediumCapability

Fine-tuning Llama-2-70B with PyTorch FSDP on a 2-node A100 cluster

AI Impact Summary

The post outlines a practical workflow to fine-tune Llama-2-70B across a 2-node, 8-GPU-per-node A100 cluster using PyTorch FSDP, with Hugging Face Transformers, Accelerate, and TRL. It emphasizes managing memory and communication with FSDP, including loading the model on a single rank with a meta device, using SHARDED_STATE_DICT for intermediate checkpoints, and switching to FULL_STATE_DICT only for final saves; it also advocates bf16, Flash Attention V2, and gradient checkpointing to reduce VRAM and training time. The guidance highlights potential failure modes such as CPU RAM exhaustion and NCCL timeouts when broadcasting large weights, underscoring the need for correct config (fsdp_config.yaml, accelerate config) and SLURM integration to realize scalable, cost-effective fine-tuning.

Affected Systems

Llama-2-70BPyTorch FSDP

Date: Date not specified
Change type: capability
Severity: medium

Fine-tuning Llama-2-70B with PyTorch FSDP on a 2-node A100 cluster

More from Hugging Face

Get alerts for Hugging Face