InfoCapability

Hugging Face Open LLM Leaderboard re-evaluated with Math-Verify — rankings reshuffled

AI Impact Summary

Open LLM Leaderboard used Math-Verify to re-evaluate all 3,751 models on the MATH-Hard task, delivering a fairer, more robust comparison. The rollout fixed multiple parsing and extraction issues (including handling of boxed answers and matrix/set representations) and led to significant score shifts, with an average of 61 more problems solved and a 4.66-point average increase, especially in algebra-related subsets. The top rankings moved notably, with AceMath taking the lead and Qwen derivatives rising; DeepSeek gains were driven by better extraction of boxed notation, reshuffling the overall leaderboard and challenging prior performance assumptions.

Affected Systems

Open LLM LeaderboardMath-Verify

Date: Date not specified
Change type: capability
Severity: info

Hugging Face Open LLM Leaderboard re-evaluated with Math-Verify — rankings reshuffled

More from Hugging Face

Get alerts for Hugging Face