Diffusers quantization backends on Flux models: BitsAndBytes nf4/8-bit, TorchAo, GGUF, Quanto
AI Impact Summary
Diffusers now exposes multiple quantization backends applied to Flux based diffusion models, including BitsAndBytes nf4/8-bit, TorchAo, GGUF, Quanto and native FP8. The example uses Flux.1-dev and Flux-dev and shows memory reductions from BF16 31.447 GB to about 12.584 GB with 4-bit and 19.273 GB with 8-bit, with inference times of 12s for 4-bit and 27s for 8-bit on an 80GB H100. This gives substantial cost and throughput benefits but requires careful validation of image quality, since 4-bit quantization can produce perceptible differences. The post also covers pipeline level quantization setup and the need to import quantization config components from both diffusers and transformers due to cross library origins.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info