MediumCapability

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

AI Impact Summary

Large language models are vulnerable to prompt injection attacks, where malicious prompts can override intended behavior and lead to unintended outputs. This vulnerability stems from the models' reliance on the initial instructions provided during training, which can be manipulated. Addressing this requires robust techniques for training LLMs to prioritize and defend against these injected instructions, ensuring consistent and safe operation.

Affected Systems

LLMs

Business Impact

Unprotected LLMs can be exploited to generate harmful content, violate company policies, or compromise sensitive data, necessitating investment in instruction-tuning and safety mechanisms.

Date: Date not specified
Change type: capability
Severity: medium

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

More from OpenAI

Get alerts for OpenAI