Hugging Face Accelerate enables running very large models on limited hardware using PyTorch meta device
AI Impact Summary
Hugging Face Accelerate leverages PyTorch's meta device and an empty-weight workflow to load and run inference for very large language models that don't fit in RAM or a single GPU. It creates an empty model, allocates weight shards across devices using infer_auto_device_map, and offloads excess weights to CPU or disk, enabling models such as OPT-6.7B, OPT-13B, and BLOOM to run on consumer hardware and in Colab. This enables rapid prototyping and inference on smaller budgets but introduces complexity around device maps, partial offloads, and compatibility of submodules; teams should plan for longer initialization and IO overhead and ensure the proper use of init_empty_weights and no_split_module_classes.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info