RLOO Trainer in TRL enables online RLHF with lower memory and faster convergence
AI Impact Summary
TRL now exposes the RLOOTrainer as an online RLHF alternative to PPO, promising lower GPU memory footprint and faster convergence. RLOO loads three model copies (policy, reference policy, reward) instead of four, treats the entire generation as a single action, and uses a batch-wide baseline, reducing OOM risk and increasing training throughput for 1B–6.9B scale models. This lowers the barrier to experimenting with online RL methods and could reduce training costs, though teams relying on PPO defaults or DPO baselines may need migration considerations.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info