Memory-efficient Diffusion Transformers with Quanto and Diffusers — FP8 quantization for PixArt-Sigma and Stable Diffusion 3
AI Impact Summary
The article demonstrates that Transformer-based diffusion backbones (0.6B–8B params) suffer from high memory usage when combined with multiple text encoders (e.g., Stable Diffusion 3). By applying Quanto quantization within Diffusers, researchers achieve meaningful memory savings (FP8, qint8) with minimal quality loss, and the biggest gains come from quantizing text encoders, which is crucial when pipelines include several encoders. This enables running larger diffusion models on consumer GPUs (e.g., FP16/H100 setups) and reduces iteration time for experimentation, though there are latency tradeoffs and potential quality impacts under aggressive quantization, plus migration steps to quantize and freeze components.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info