SDXL inference optimizations in Diffusers: FP16, SDPA, and torch.compile
AI Impact Summary
This SDXL optimization briefing demonstrates practical inference-speed and memory-reduction strategies via diffusers for stabilityai/stable-diffusion-xl-base-1.0. It shows baseline ~28GB memory and ~72s per 4-image batch with full precision, then FP16 reduces to ~21.7GB and 14.8s, and SDPA keeps memory the same while dropping to 11.4s. Adding torch.compile lowers the time further to about 10.2s, with the caveat that the first compile is slow but subsequent runs are faster. It also covers CPU offloading options (CPU offload and sequential offload) that shave memory to ~20.2GB or ~19.9GB at the expense of latency, with sequential offload pushing latency to around 67s.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info