Open LLM Leaderboard: DROP benchmark scoring issues require new evaluation harness
AI Impact Summary
The Open LLM Leaderboard’s DROP benchmark is experiencing significant scoring issues due to a flawed normalization step. Models fail to correctly interpret floating-point numbers when followed by whitespace, leading to drastically low f1-scores. This highlights a critical vulnerability in the benchmark’s evaluation process and necessitates a complete overhaul of the evaluation harness.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info