Open CoT Leaderboard measures chain-of-thought impact on LLM accuracy
AI Impact Summary
The Open CoT Leaderboard introduces a standardized way to compare LLMs on chain-of-thought (CoT) generation by measuring the delta between accuracy with CoT prompts and without CoT prompts, using AGIEval/logikon-bench tasks. It implements modular CoT generation via two prompting strategies (Classic and Reflect) and uses loglikelihood-based scoring for baseline accuracy on multiple-choice tasks, with examples like Mixtral-8x7B-Instruct-v0.1 demonstrating current capabilities. This provides engineering teams with concrete, task-level insight into which models and prompting approaches yield tangible CoT benefits, informing decisions around model selection, Langchain/LMQL integration for reasoning pipelines, and long-term strategy to mitigate data contamination concerns in benchmarking.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info