InfoCapability

Introducing Red-Teaming Resistance Leaderboard — benchmarks LLM robustness

AI Impact Summary

The Red-Teaming Resistance Leaderboard introduces a new benchmark for evaluating the robustness of large language models against adversarial prompts. This leaderboard focuses on measuring a model's ability to resist harmful behaviors, categorized by specific violations like promoting violence, generating inappropriate content, and facilitating illegal activities. The benchmark utilizes datasets like AdvBench and AART, alongside human evaluation, to assess model vulnerabilities across these categories, revealing that closed-source models like GPT-4 and Claude-2 currently demonstrate superior resilience.

Affected Systems

GPT-4Claude-2

Date: Date not specified
Change type: capability
Severity: info

Introducing Red-Teaming Resistance Leaderboard — benchmarks LLM robustness

More from Hugging Face

Get alerts for Hugging Face