InfoCapability

Open LLM Leaderboard re-evaluated with Math-Verify; 3,751 models reassessed and rankings reshuffled

AI Impact Summary

Open LLM Leaderboard has been re-evaluated using Math-Verify, covering 3,751 models submitted to date to ensure fairer math benchmarking. Math-Verify fixes parsing and symbolic extraction gaps (e.g., boxed/LaTeX answers) that previously caused underestimation of model math ability, leading to more accurate comparisons on the MATH-Hard task. On average, models solved 61 more problems and gained 4.66 points, with algebra-related subsets seeing the largest gains (Algebra +8.27, Prealgebra +6.93); some models improved nearly 90 points. The changes disproportionately benefited Qwen derivatives and DeepSeek, reshuffling the Top 20 with Nvidia AceMath now dominating.

Affected Systems

Open LLM LeaderboardMath-Verify

Date: Date not specified
Change type: capability
Severity: info

Open LLM Leaderboard re-evaluated with Math-Verify; 3,751 models reassessed and rankings reshuffled

More from Hugging Face

Get alerts for Hugging Face