InfoCapability

GPT-OSS agentic RL training stability issues with Verl framework

AI Impact Summary

This retrospective documents instability in agentic RL training for GPT-OSS using the Verl framework, showing exploding KL divergence and gradient norms with GPT-OSS-20B (and 120B) and comparatively stronger rewards for Qwen-2.5-32B. A root cause is log-probability mismatch in a Mixture-of-Experts (MoE) setup that breaks the PPO on-policy ratio, alongside a training–inference mismatch between precision-focused training (FSDP) and throughput-optimized inference engines (vLLM, SGLang); a fix that overwrites old_log_prob to the new log_prob helps but does not fully stabilize learning on GSM8K even when tool use is removed. Until training and inference are co-optimized (routing determinism, MoE behavior, on-policy data alignment, and template compatibility such as Harmony), deploying reliable agentic RL capabilities in GPT-OSS will remain challenging and costly.

Affected Systems

GPT-OSS-20B

Date: Date not specified
Change type: capability
Severity: info

GPT-OSS agentic RL training stability issues with Verl framework

More from Hugging Face

Get alerts for Hugging Face