Red-Teaming Resistance Leaderboard benchmarks frontier LLM robustness (GPT-4, Claude-2, Vicuna-13B)
AI Impact Summary
Haize Labs' Red-Teaming Resistance Benchmark introduces a formal, cross-dataset score for frontier LLM robustness by challenging models with high-quality adversarial prompts from datasets like AdvBench, AART, Beavertails, DNA, RedEval-HarmfulQA, and RedEval-DangerousQA. The evaluation uses LlamaGuard taxonomy and classifications from GPT-4 to quantify Safe vs Unsafe outputs, with the finding that GPT-4 and Claude-2 lead across categories, though results may reflect safety components layered behind APIs rather than intrinsic model capability. Expect measurement variations between API-hosted and self-hosted deployments and increased emphasis on red-team driven risk assessment in vendor selection and safety control design.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info