Hugging Face: TRL introduces RLOO Trainer for memory-efficient online RLHF, replacing PPO | SignalBreak | SignalBreak