BigCodeBench: LLM Programming Benchmark Shows Limited Performance
AI Impact Summary
BigCodeBench introduces a new benchmark, BigCodeBench, designed to more accurately assess LLMs' programming capabilities compared to existing benchmarks like HumanEval. The benchmark’s complexity, including 1,140 function-level tasks with diverse library calls and 5.6 test cases, aims to better reflect real-world software development challenges. However, initial results show that even state-of-the-art models like GPT-4o struggle with the benchmark, particularly when it comes to handling complex instructions and missing setup components, highlighting a need for further improvements in LLM reasoning and instruction following.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info