InfoCapability

Hugging Face Transformers integrates AutoGPTQ-based GPTQ quantization for 8/4/3/2-bit LLMs

AI Impact Summary

Transformers now integrates the AutoGPTQ library to perform GPTQ-based quantization of models to 8/4/3/2-bit precision, with negligible accuracy loss at 4-bit in many cases. This yields memory reductions near 4x for int4 and can keep inference latency close to FP16 for small batches, enabling deployment of larger LLMs on consumer GPUs. The integration supports NVIDIA GPUs and ROCm AMD GPUs and requires a calibration dataset, with options for Exllama and fused-attention kernels; it also spans model families like Llama-2, OPT, and TheBloke variants, expanding the set of quantizable models via Transformers and Optimum. Plan for calibration data, kernel compatibility, and model-level accuracy checks when migrating from FP16 to GPTQ quantized weights.

Affected Systems

Hugging Face TransformersAutoGPTQ library

Date: Date not specified
Change type: capability
Severity: info

Hugging Face Transformers integrates AutoGPTQ-based GPTQ quantization for 8/4/3/2-bit LLMs

More from Hugging Face

Get alerts for Hugging Face