π€ Transformers adds native quantization: bitsandbytes and auto-gptq for PyTorch models
AI Impact Summary
π€ Transformers now ships native quantization paths via bitsandbytes and auto-gptq for PyTorch, enabling large models to run on smaller GPUs and to fine-tune adapters on quantized bases. Bitsandbytes offers zero-shot 8-bit/4-bit quantization with out-of-the-box cross-modality support and automatic device placement, but 4-bit serialization is not yet supported and adapter merging on quantized bases is not available for GPTQ. Auto-GPTQ emphasizes speed for text generation and supports up to 2-bit quantization with easy serialization, but requires a calibration dataset, can take hours to quantize large models, and is currently focused on language models. The post also references benchmarks using Llama-2-13b-hf and Llama-2-7b-hf, and notes AMD compatibility and multi-model coverage (Whisper, ViT, Blip2) that affect deployment options.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info