Open LLM Leaderboard re-evaluated with Math-Verify; 3,751 models reassessed and rankings reshuffled
AI Impact Summary
Open LLM Leaderboard has been re-evaluated using Math-Verify, covering 3,751 models submitted to date to ensure fairer math benchmarking. Math-Verify fixes parsing and symbolic extraction gaps (e.g., boxed/LaTeX answers) that previously caused underestimation of model math ability, leading to more accurate comparisons on the MATH-Hard task. On average, models solved 61 more problems and gained 4.66 points, with algebra-related subsets seeing the largest gains (Algebra +8.27, Prealgebra +6.93); some models improved nearly 90 points. The changes disproportionately benefited Qwen derivatives and DeepSeek, reshuffling the Top 20 with Nvidia AceMath now dominating.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info