TRL introduces RLOO Trainer for memory-efficient online RLHF, replacing PPO
AI Impact Summary
TRL unveils the RLOO Trainer as an online RLHF alternative to PPO that reduces memory footprint and speeds up convergence. RLOO uses 50-70% less vRAM than PPO and runs 2x faster on 1B-scale models (up to 3x faster on larger models), while delivering competitive win rates and outperforming offline methods like DPO. Adoption enables larger batch sizes and easier experimentation with online RL methods, but teams should validate reward baselines and update pipelines to use RLOOTrainer and associated configs.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info