Training Design for Text-to-Image Models: REPA speeds convergence for PRX-1.2B in Flux VAE space
AI Impact Summary
The post documents a structured experimental logbook for training efficient text-to-image foundation models from scratch, focusing on training-time optimizations rather than architectural novelties. It highlights representation alignment via REPA, which injects a frozen vision encoder supervision to guide early learning and reduce compute, alongside a baseline flow-matching setup in Flux VAE latent space using a PRX-1.2B configuration. The work emphasizes reproducibility (clear baseline, single configuration across ablations) and upcoming public code and a ‘speedrun’ to demonstrate end-to-end gains. This informs technical teams on concrete techniques to accelerate convergence and improve stability under tight compute budgets when scaling text-to-image models.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info