InfoCapability

Quanto: PyTorch quantization backend for Optimum enables int2/int4/int8 and float8 quantization in Transformers workflows

AI Impact Summary

Quanto introduces a PyTorch quantization backend for Optimum, enabling end-to-end quantization workflows in eager mode across CPU, CUDA, and MPS. It supports int2/int4/int8 weights and int8/float8 activations, with optional calibration and quantization-aware training, and integrates with transformers via QuantoConfig and from_pretrained. Quantized models can be serialized with safetensors and a quantization_map for reload via requantize, streamlining deployment for models such as openai/whisper-large-v3 and meta-llama/Meta-Llama-3.1-8B. The design emphasizes simple primitives across modalities and provides device-specific kernels to accelerate quantized matmuls on CUDA, enabling edge and on-device deployments.

Affected Systems

OptimumHugging Face Transformers

Date: Date not specified
Change type: capability
Severity: info

Quanto: PyTorch quantization backend for Optimum enables int2/int4/int8 and float8 quantization in Transformers workflows

More from Hugging Face

Get alerts for Hugging Face