Train your ControlNet with diffusers on Stable Diffusion v2-1-base using SPIGA facial landmarks
AI Impact Summary
This post documents training a ControlNet-conditioned diffusion model using the diffusers framework, guided by a SPIGA-based facial-landmarks conditioning workflow. It covers constructing a 100K-face dataset (FaceSynthetics SPIGA with captions), generating conditioning images, and training Stable Diffusion v2-1-base with memory-efficient attention, a 512x512 resolution, and a multi-step training run. The approach highlights the practical GPU requirements (8GB+ VRAM, example uses an A100) and risks from synthetic data such as overfitting and uncanny 3D-looking faces, underscoring the need for careful evaluation and potentially curating real-data alternatives. It also shows the tooling footprint (wandb for tracking, Hugging Face Hub for publishing) and assumes a path to reproducible models via standard diffusers scripts.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info