InfoCapability

Hugging Face Transformers quantization: bitsandbytes vs auto-gptq for inference and adapters

AI Impact Summary

The Transformers overview contrasts bitsandbytes and auto-gptq as native quantization options, with bitsandbytes enabling zero-shot 8-bit/4-bit quantization for any model containing torch.nn.Linear modules and supporting cross-modality usage (e.g., Whisper, ViT, Blip2). Auto-GPTQ provides faster text generation and 2- to 4-bit quantization, but requires a calibration dataset and longer quantization times, with serialization constraints depending on the bit-width and model source (e.g., TheBloke GPTQ models on HuggingFace). A key deployment trade-off is adapter handling: adapters merged on top of a quantized base model incur no inference loss with bitsandbytes, while this merge is not supported for GPTQ. Benchmarks referenced include meta-llama/Llama-2-13b-hf and meta-llama/Llama-2-7b-hf in TheBloke GPTQ work, illustrating performance differences across hardware and batch sizes.

Affected Systems

Hugging Face Transformers

Date: Date not specified
Change type: capability
Severity: info

Hugging Face Transformers quantization: bitsandbytes vs auto-gptq for inference and adapters

More from Hugging Face

Get alerts for Hugging Face