Memory-efficient Diffusion Transformers with Quanto in Diffusers — FP8/INT8 for PixArt-Sigma and SD3
AI Impact Summary
The post demonstrates memory optimization for Transformer-based diffusion backbones by applying Quanto quantization through the Diffusers toolkit. Quantizing the diffusion transformer to FP8 (and selectively quantizing text encoders) can roughly halve peak GPU memory in representative SD3 configurations (e.g., from ~11.5 GB to ~5.3 GB for a single-text-encoder setup), enabling larger models like PixArt-Sigma and Stable Diffusion 3 to run on consumer GPUs with modest hardware. It also highlights practical caveats: certain TE quantization combos (notably the second TE in SD3) may not work well, and latency can increase with FP8/INT8 configurations; choices like excluding the final projection (proj_out) and using quantization-aware training can mitigate quality loss. Overall, this provides a viable migration path for deploying memory-intensive diffusion transformers in production, provided teams validate latency, quality, and TE quantization strategy across their specific pipelines.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info