InfoCapability

GPT-OSS agentic RL training retrospective — fixes for PPO on-policy integrity and MoE routing

AI Impact Summary

GPT-OSS agentic RL training aims to optimize multi-step decision making, but early runs show exploding gradient norms, rising entropy, and non-improving rewards, signaling unstable optimization. The report pins root causes to MoE routing mismatch between dual forward passes, a log-probability mismatch that breaks PPO's ratio, and broader training–inference misalignment between precision-focused training (FSDP) and throughput-oriented inference (vLLM/SGLang). A concrete fix enforces ratio=1 via log-prob substitution for on-policy data, yet instability persists even on simplified tasks like GSM8K, implying deeper architectural or framework-level misalignment. The path forward highlights the need to align routing determinism, Harmony template compatibility, and verl framework behavior before GPT-OSS can reliably serve as a backbone for agentic workflows that rely on tool use and multi-step coordination.

Affected Systems

GPT-OSS

Date: Date not specified
Change type: capability
Severity: info

GPT-OSS agentic RL training retrospective — fixes for PPO on-policy integrity and MoE routing

More from Hugging Face

Get alerts for Hugging Face