Reproducing OpenAI's RLHF with PPO — TensorFlow 1.x Implementation Details
AI Impact Summary
This document details the reproduction of OpenAI's 2019 RLHF codebase, primarily focused on the implementation of Proximal Policy Optimization (PPO) with Reinforcement Learning from Human Feedback. The key technical insights revolve around how the model handles input sequences, specifically padding and tokenization, and how it masks padding tokens during logit calculation. The reproduction relies on a TensorFlow 1.x environment and highlights the challenges associated with running the original code on modern hardware, including GPU limitations and incompatibility with newer CUDA versions, necessitating the use of a specific AWS instance (p3dn.24xlarge).
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info