Fine-tune Stable Diffusion with DDPO via TRL: DDPOTrainer and PPO-based optimization
AI Impact Summary
This describes fine-tuning Stable Diffusion with DDPO (Denoising Diffusion Policy Optimization) using the TRL library, framing denoising as a multi-step MDP and applying PPO-style updates guided by an aesthetic reward model. It relies on the DDPOTrainer/DDPOConfig classes, the diffusers/trl stack, and outputs hosted on HuggingFace Hub with optional WandB logging, targeting alignment to human preferences in image quality. The approach mandates substantial compute (A100+ GPUs) and careful management of reward data and training stability to avoid degraded image quality or unsafe outputs.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info