PRX Part 3: 24-hour pixel-space diffusion training for text-to-image models
AI Impact Summary
PRX Part 3 documents a 24-hour speedrun training a text-to-image diffusion model directly in pixel space, combining x-prediction, 32x32 patches, and a 256-dimensional bottleneck to keep token counts manageable at 512px and 1024px resolutions. The run stacks efficiency tricks—LPIPS and DINO perceptual losses, TREAD token routing, REPA-DINO representation alignment, and Muon/FSDP optimization—to squeeze quality and convergence within a $1.5k compute budget. While results on synthetic data are promising and reproducible via open-source code, the model shows residual texture glitches and undertraining artifacts, underscoring that broader data diversity and more compute remain necessary for production-grade generalization.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info