TRL adopts async RL: disaggregate inference from training with vLLM/SGLang, Ray, and NCCL
AI Impact Summary
The article argues that synchronous RL training is bottlenecked by data generation and proposes a shift to disaggregated inference and training on separate GPU pools with a rollout buffer and asynchronous weight transfer. This approach improves GPU utilization by overlapping generation with gradient computation and mitigates the straggler problem, at the cost of more GPUs and a more complex data/weight transfer stack (e.g., NCCL for weight sync, Ray for orchestration, and a rollout buffer). For teams, this implies adopting async trainer patterns (as exemplified by TRL's GRPOTrainer) and integrating vLLM or SGLang for inference, with careful attention to staleness handling and potential LoRA/MoE considerations in distributed setups.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info