CriticalCapability

OpenAI Research Demo: "Golden Gate Claude" Demonstrates Feature Manipulation

Action Required

This demo provides insights into OpenAI's research on interpretability and feature manipulation within large language models, potentially informing future development and safety measures.

AI Impact Summary

OpenAI is releasing a research demo, "Golden Gate Claude", to showcase its work on interpretability within Claude 3 Sonnet. This model demonstrates the ability to manipulate internal features within the model – in this case, the activation of the "Golden Gate Bridge" concept – leading to unexpected and playful responses. This is a demonstration of a new research capability, not a production model, and highlights the potential for controlling and understanding AI behavior.

Affected Systems

Claude 3 Sonnet

Date: 23 May 2024
Change type: capability
Severity: critical

OpenAI Research Demo: "Golden Gate Claude" Demonstrates Feature Manipulation

More from Anthropic

Get alerts for Anthropic