Open Chain of Thought Leaderboard introduced — measuring reasoning trace quality
AI Impact Summary
Open Chain of Thought is introducing a new leaderboard focused on evaluating LLMs’ ability to generate effective chain-of-thought reasoning traces. This is a significant shift from traditional accuracy benchmarks, prioritizing the quality of reasoning over raw answer correctness. The leaderboard’s design, utilizing a Δ (difference) metric between CoT and non-CoT accuracy, aims to provide a more robust assessment of LLM capabilities, particularly in challenging reasoning tasks, and is designed to be more resistant to training data contamination.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info