Co-located vLLM in TRL enables shared-GPU training and inference for GRPO
AI Impact Summary
TRL now supports running vLLM inside the same distributed process as the trainer, enabling GPUs to be shared between training and generation. This eliminates the previous server-mode HTTP boundary, removing idle GPU time and the need for separate GPUs dedicated to inference. This optimization is especially impactful for GRPO online-learning workloads where generation happens continuously, boosting throughput while reducing hardware costs. Adoption requires configuring vllm_mode to colocate in GRPOConfig and tuning vLLM memory utilization; the integration uses the external_launcher and remains compatible with torchrun, tensor/data parallelism, and SPMD execution.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info