InfoCapability

Open LLM Leaderboard fixed with Math-Verify — reshuffling top models

AI Impact Summary

The Open LLM Leaderboard’s math evaluation capabilities have been overhauled with the introduction of Math-Verify, addressing issues with answer formatting and SymPy parsing that previously led to inaccurate model rankings. This change has resulted in a significant reshuffling of the leaderboard, particularly benefiting Nvidia’s AceMath models and Qwen derivatives, demonstrating a 4.66-point average increase across the MATH-Hard subset and highlighting the need for more robust math evaluation in LLM comparisons.

Affected Systems

Open LLM LeaderboardSymPy

Date: Date not specified
Change type: capability
Severity: info

Open LLM Leaderboard fixed with Math-Verify — reshuffling top models

More from Hugging Face

Get alerts for Hugging Face