Hugging Face Open LLM Leaderboard re-evaluated with Math-Verify — rankings reshuffled
AI Impact Summary
Open LLM Leaderboard used Math-Verify to re-evaluate all 3,751 models on the MATH-Hard task, delivering a fairer, more robust comparison. The rollout fixed multiple parsing and extraction issues (including handling of boxed answers and matrix/set representations) and led to significant score shifts, with an average of 61 more problems solved and a 4.66-point average increase, especially in algebra-related subsets. The top rankings moved notably, with AceMath taking the lead and Qwen derivatives rising; DeepSeek gains were driven by better extraction of boxed notation, reshuffling the overall leaderboard and challenging prior performance assumptions.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info