InfoCapability

Open LLM Leaderboard: DROP benchmark scoring issues require new evaluation harness

AI Impact Summary

The Open LLM Leaderboard’s DROP benchmark is experiencing significant scoring issues due to a flawed normalization step. Models fail to correctly interpret floating-point numbers when followed by whitespace, leading to drastically low f1-scores. This highlights a critical vulnerability in the benchmark’s evaluation process and necessitates a complete overhaul of the evaluation harness.

Affected Systems

Open LLM LeaderboardEleutherAI Harness

Date: Date not specified
Change type: capability
Severity: info

Checking your AI register…

Get alerts for Hugging Face

SignalBreak monitors Hugging Face and 27 other AI providers across 150+ endpoints. Sign up free to get notified when things change.

Open LLM Leaderboard: DROP benchmark scoring issues require new evaluation harness

More from Hugging Face

Get alerts for Hugging Face