MediumCapability

PPO-based RL agent learns Montezuma’s Revenge from a single demonstration

AI Impact Summary

By seeding PPO with carefully chosen states from a human demonstration, the agent bypasses extensive random exploration to achieve a high score on Montezuma’s Revenge. This showcases strong sample efficiency in a sparse-reward environment, indicating that demonstration-primed policies can reach competitive performance with far less data. For teams, this points to investment areas in demonstration selection, evaluation pipelines, and applying the approach to other hard RL benchmarks or real-world control tasks where demonstrations are available.

Affected Systems

Montezuma’s RevengePPO (Proximal Policy Optimization)

Date: Date not specified
Change type: capability
Severity: medium

PPO-based RL agent learns Montezuma’s Revenge from a single demonstration

More from OpenAI

Get alerts for OpenAI