Introducing Red-Teaming Resistance Leaderboard — benchmarks LLM robustness
AI Impact Summary
The Red-Teaming Resistance Leaderboard introduces a new benchmark for evaluating the robustness of large language models against adversarial prompts. This leaderboard focuses on measuring a model's ability to resist harmful behaviors, categorized by specific violations like promoting violence, generating inappropriate content, and facilitating illegal activities. The benchmark utilizes datasets like AdvBench and AART, alongside human evaluation, to assess model vulnerabilities across these categories, revealing that closed-source models like GPT-4 and Claude-2 currently demonstrate superior resilience.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info