MediumCapability

Learning from human preferences — new preference-inference capability (DeepMind collaboration)

AI Impact Summary

A new capability enables systems to infer human-aligned objectives from pairwise preference data, eliminating the need to handcraft goal functions. By querying which of two behaviors is better, the model learns a surrogate objective to guide policy updates, improving safety alignment and reducing mis-specification risk. Technical teams should plan for preference data pipelines, labeling throughput, and validation to prevent preference bias from distorting behavior.

Business Impact

Organizations can reduce risk from mis-specified goals by training models to optimize behaviors aligned with human preferences, but must invest in data labeling, quality control, and bias mitigation in preference data.

Risk domains

792%

Source text

Date: Date not specified
Change type: capability
Severity: medium

Learning from human preferences — new preference-inference capability (DeepMind collaboration)

More from OpenAI

Get alerts for OpenAI