InfoCapability

RLHF with PPO: Reproducing OpenAI’s 2019 codebase (lm-human-preferences)

AI Impact Summary

The post documents a reproduction of OpenAI’s 2019 RLHF with PPO implementation, detailing how the reward model and policy value head use the concatenated query and response to generate per-token rewards. It highlights critical engineering choices like padding, position indexing, and masking for GPT-2 style inputs, which are essential for faithful replication of the learning curves and reward signals. It also exposes significant hardware and framework constraints (TensorFlow 1.x, multi-GPU setups, AWS p3dn.24xlarge) that limit scalability and push teams toward modern PyTorch-based stacks such as HuggingFace TRL. Additionally, it notes dataset issues and the need to adapt to HF datasets for reproduction, underscoring validation risks when transitioning from OpenAI’s original data and codebase.

Affected Systems

OpenAI lm-human-preferences

Date: Date not specified
Change type: capability
Severity: info

RLHF with PPO: Reproducing OpenAI’s 2019 codebase (lm-human-preferences)

More from Hugging Face

Get alerts for Hugging Face