Hugging Face Transformers integrates AutoGPTQ-based GPTQ quantization for 8/4/3/2-bit LLMs
AI Impact Summary
Transformers now integrates the AutoGPTQ library to perform GPTQ-based quantization of models to 8/4/3/2-bit precision, with negligible accuracy loss at 4-bit in many cases. This yields memory reductions near 4x for int4 and can keep inference latency close to FP16 for small batches, enabling deployment of larger LLMs on consumer GPUs. The integration supports NVIDIA GPUs and ROCm AMD GPUs and requires a calibration dataset, with options for Exllama and fused-attention kernels; it also spans model families like Llama-2, OPT, and TheBloke variants, expanding the set of quantizable models via Transformers and Optimum. Plan for calibration data, kernel compatibility, and model-level accuracy checks when migrating from FP16 to GPTQ quantized weights.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info