InfoCapability

Reproducing OpenAI's RLHF with PPO — TensorFlow 1.x Implementation Details

AI Impact Summary

This document details the reproduction of OpenAI's 2019 RLHF codebase, primarily focused on the implementation of Proximal Policy Optimization (PPO) with Reinforcement Learning from Human Feedback. The key technical insights revolve around how the model handles input sequences, specifically padding and tokenization, and how it masks padding tokens during logit calculation. The reproduction relies on a TensorFlow 1.x environment and highlights the challenges associated with running the original code on modern hardware, including GPU limitations and incompatibility with newer CUDA versions, necessitating the use of a specific AWS instance (p3dn.24xlarge).

Affected Systems

GPT-2transformers

Date: Date not specified
Change type: capability
Severity: info

Reproducing OpenAI's RLHF with PPO — TensorFlow 1.x Implementation Details

More from Hugging Face

Get alerts for Hugging Face