Hugging Face: GPT-OSS agentic RL training retrospective — fixes for PPO on-policy integrity and MoE routing | SignalBreak | SignalBreak