InfoCapability

Open CoT Leaderboard measures chain-of-thought impact on LLM accuracy

AI Impact Summary

The Open CoT Leaderboard introduces a standardized way to compare LLMs on chain-of-thought (CoT) generation by measuring the delta between accuracy with CoT prompts and without CoT prompts, using AGIEval/logikon-bench tasks. It implements modular CoT generation via two prompting strategies (Classic and Reflect) and uses loglikelihood-based scoring for baseline accuracy on multiple-choice tasks, with examples like Mixtral-8x7B-Instruct-v0.1 demonstrating current capabilities. This provides engineering teams with concrete, task-level insight into which models and prompting approaches yield tangible CoT benefits, informing decisions around model selection, Langchain/LMQL integration for reasoning pipelines, and long-term strategy to mitigate data contamination concerns in benchmarking.

Affected Systems

Open CoT LeaderboardAGIEval

Date: Date not specified
Change type: capability
Severity: info

Open CoT Leaderboard measures chain-of-thought impact on LLM accuracy

More from Hugging Face

Get alerts for Hugging Face