RLHF with PPO: Reproducing OpenAI’s 2019 codebase (lm-human-preferences)
AI Impact Summary
The post documents a reproduction of OpenAI’s 2019 RLHF with PPO implementation, detailing how the reward model and policy value head use the concatenated query and response to generate per-token rewards. It highlights critical engineering choices like padding, position indexing, and masking for GPT-2 style inputs, which are essential for faithful replication of the learning curves and reward signals. It also exposes significant hardware and framework constraints (TensorFlow 1.x, multi-GPU setups, AWS p3dn.24xlarge) that limit scalability and push teams toward modern PyTorch-based stacks such as HuggingFace TRL. Additionally, it notes dataset issues and the need to adapt to HF datasets for reproduction, underscoring validation risks when transitioning from OpenAI’s original data and codebase.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info