Quanto: PyTorch quantization backend for Optimum enables int2/int4/int8 and float8 quantization in Transformers workflows
AI Impact Summary
Quanto introduces a PyTorch quantization backend for Optimum, enabling end-to-end quantization workflows in eager mode across CPU, CUDA, and MPS. It supports int2/int4/int8 weights and int8/float8 activations, with optional calibration and quantization-aware training, and integrates with transformers via QuantoConfig and from_pretrained. Quantized models can be serialized with safetensors and a quantization_map for reload via requantize, streamlining deployment for models such as openai/whisper-large-v3 and meta-llama/Meta-Llama-3.1-8B. The design emphasizes simple primitives across modalities and provides device-specific kernels to accelerate quantized matmuls on CUDA, enabling edge and on-device deployments.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info