InfoCapability

Red-Teaming Resistance Leaderboard benchmarks frontier LLM robustness (GPT-4, Claude-2, Vicuna-13B)

AI Impact Summary

Haize Labs' Red-Teaming Resistance Benchmark introduces a formal, cross-dataset score for frontier LLM robustness by challenging models with high-quality adversarial prompts from datasets like AdvBench, AART, Beavertails, DNA, RedEval-HarmfulQA, and RedEval-DangerousQA. The evaluation uses LlamaGuard taxonomy and classifications from GPT-4 to quantify Safe vs Unsafe outputs, with the finding that GPT-4 and Claude-2 lead across categories, though results may reflect safety components layered behind APIs rather than intrinsic model capability. Expect measurement variations between API-hosted and self-hosted deployments and increased emphasis on red-team driven risk assessment in vendor selection and safety control design.

Affected Systems

GPT-4Claude-2

Date: Date not specified
Change type: capability
Severity: info

Red-Teaming Resistance Leaderboard benchmarks frontier LLM robustness (GPT-4, Claude-2, Vicuna-13B)

More from Hugging Face

Get alerts for Hugging Face