Open LLM Leaderboard fixed with Math-Verify — reshuffling top models
AI Impact Summary
The Open LLM Leaderboard’s math evaluation capabilities have been overhauled with the introduction of Math-Verify, addressing issues with answer formatting and SymPy parsing that previously led to inaccurate model rankings. This change has resulted in a significant reshuffling of the leaderboard, particularly benefiting Nvidia’s AceMath models and Qwen derivatives, demonstrating a 4.66-point average increase across the MATH-Hard subset and highlighting the need for more robust math evaluation in LLM comparisons.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info