vLLM V0 to V1: Correctness Before Corrections in RL Pipeline
AI Impact Summary
The migration from vLLM V0 to V1 involved critical fixes to ensure correct logprob computation for online reinforcement learning. Specifically, the team addressed semantic mismatches by processing logprobs as 'processed_logprobs' instead of raw model outputs, runtime defaults to match the V1 engine, and the use of an fp32 lm_head for the final projection. These changes were necessary to eliminate a training-inference mismatch that initially caused the V1 run to deviate significantly from the V0 reference, highlighting the importance of backend parity in RL training pipelines.
Affected Systems
Business Impact
Incorrect logprob computation during RL training with vLLM V1 can lead to suboptimal policy updates and reduced training efficiency, requiring significant debugging and potentially delaying model improvements.
- Date
- Date not specified
- Change type
- capability
- Severity
- info