InfoCapability

🤗 Transformers adds native quantization: bitsandbytes and auto-gptq for PyTorch models

AI Impact Summary

🤗 Transformers now ships native quantization paths via bitsandbytes and auto-gptq for PyTorch, enabling large models to run on smaller GPUs and to fine-tune adapters on quantized bases. Bitsandbytes offers zero-shot 8-bit/4-bit quantization with out-of-the-box cross-modality support and automatic device placement, but 4-bit serialization is not yet supported and adapter merging on quantized bases is not available for GPTQ. Auto-GPTQ emphasizes speed for text generation and supports up to 2-bit quantization with easy serialization, but requires a calibration dataset, can take hours to quantize large models, and is currently focused on language models. The post also references benchmarks using Llama-2-13b-hf and Llama-2-7b-hf, and notes AMD compatibility and multi-model coverage (Whisper, ViT, Blip2) that affect deployment options.

Affected Systems

bitsandbytes

Date: Date not specified
Change type: capability
Severity: info

🤗 Transformers adds native quantization: bitsandbytes and auto-gptq for PyTorch models

More from Hugging Face

Get alerts for Hugging Face