RLHF with PPO: Reproducing OpenAI lm-human-preferences in TensorFlow 1.x
AI Impact Summary
The post documents reproducing OpenAI's 2019 RLHF/ PPO workflow, detailing reward model/value head input as query+response, padding token handling, and per-token rewards, with explicit notes on hardware constraints for TF1.x code. This level of specificity matters for engineers attempting to reproduce results, as exact tokenization, padding, and batching behavior determine learning curves and comparison to OpenAI's results. The material also highlights that the reproduction relies on legacy TensorFlow 1.x code and extreme multi-GPU setups (e.g., 8x V100 32GB on AWS p3dn.24xlarge) and has limited practicality for production; for scalability and future work, moving to modern pipelines such as HuggingFace RLHF with PEFT is advised.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info