Hugging Face Transformers quantization: bitsandbytes vs auto-gptq for inference and adapters
AI Impact Summary
The Transformers overview contrasts bitsandbytes and auto-gptq as native quantization options, with bitsandbytes enabling zero-shot 8-bit/4-bit quantization for any model containing torch.nn.Linear modules and supporting cross-modality usage (e.g., Whisper, ViT, Blip2). Auto-GPTQ provides faster text generation and 2- to 4-bit quantization, but requires a calibration dataset and longer quantization times, with serialization constraints depending on the bit-width and model source (e.g., TheBloke GPTQ models on HuggingFace). A key deployment trade-off is adapter handling: adapters merged on top of a quantized base model incur no inference loss with bitsandbytes, while this merge is not supported for GPTQ. Benchmarks referenced include meta-llama/Llama-2-13b-hf and meta-llama/Llama-2-7b-hf in TheBloke GPTQ work, illustrating performance differences across hardware and batch sizes.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info