Anthropic expands model safety bug bounty program to test new mitigations
Action Required
Failure to address universal jailbreaks could lead to misuse of Anthropic's AI models with potentially severe consequences in high-risk domains.
AI Impact Summary
Anthropic is expanding its model safety bug bounty program to proactively identify and mitigate universal jailbreak attacks, particularly in high-risk domains like CBRN and cybersecurity. This initiative focuses on testing a new, unreleased safety mitigation system, offering rewards up to $15,000 for successful submissions. Addressing universal jailbreaks is critical to preventing misuse of AI models and ensuring responsible AI development, aligning with broader industry commitments.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- high