GPT-OSS agentic RL training retrospective — fixes for PPO on-policy integrity and MoE routing
AI Impact Summary
GPT-OSS agentic RL training aims to optimize multi-step decision making, but early runs show exploding gradient norms, rising entropy, and non-improving rewards, signaling unstable optimization. The report pins root causes to MoE routing mismatch between dual forward passes, a log-probability mismatch that breaks PPO's ratio, and broader training–inference misalignment between precision-focused training (FSDP) and throughput-oriented inference (vLLM/SGLang). A concrete fix enforces ratio=1 via log-prob substitution for on-policy data, yet instability persists even on simplified tasks like GSM8K, implying deeper architectural or framework-level misalignment. The path forward highlights the need to align routing determinism, Harmony template compatibility, and verl framework behavior before GPT-OSS can reliably serve as a backbone for agentic workflows that rely on tool use and multi-step coordination.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info