MediumCapability

Adversarial attacks on neural network policies — risk to policy enforcement

AI Impact Summary

Adversarial capability exists within neural network policy modules, allowing crafted inputs to sway or bypass policy decisions. This creates a risk of unsafe outputs or policy violations in systems that rely on learned policies for content moderation, agent behavior, or compliance checks. To protect operations, focus on robust policy hardening, monitoring for anomalous policy behavior, and coordinated testing across model versions.

Business Impact

Exploitation of policy vulnerabilities could lead to unsafe or non-compliant outputs, exposing the business to regulatory risk, user safety concerns, and reputational damage until mitigations are deployed.

Risk domains

292%778%

Date: Date not specified
Change type: capability
Severity: medium

Adversarial attacks on neural network policies — risk to policy enforcement

More from OpenAI

Get alerts for OpenAI