GPT-OSS agentic RL training stability issues with Verl framework
AI Impact Summary
This retrospective documents instability in agentic RL training for GPT-OSS using the Verl framework, showing exploding KL divergence and gradient norms with GPT-OSS-20B (and 120B) and comparatively stronger rewards for Qwen-2.5-32B. A root cause is log-probability mismatch in a Mixture-of-Experts (MoE) setup that breaks the PPO on-policy ratio, alongside a training–inference mismatch between precision-focused training (FSDP) and throughput-optimized inference engines (vLLM, SGLang); a fix that overwrites old_log_prob to the new log_prob helps but does not fully stabilize learning on GSM8K even when tool use is removed. Until training and inference are co-optimized (routing determinism, MoE behavior, on-policy data alignment, and template compatibility such as Harmony), deploying reliable agentic RL capabilities in GPT-OSS will remain challenging and costly.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info