InfoCapability

BigCodeBench: LLM Programming Benchmark Shows Limited Performance

AI Impact Summary

BigCodeBench introduces a new benchmark, BigCodeBench, designed to more accurately assess LLMs' programming capabilities compared to existing benchmarks like HumanEval. The benchmark’s complexity, including 1,140 function-level tasks with diverse library calls and 5.6 test cases, aims to better reflect real-world software development challenges. However, initial results show that even state-of-the-art models like GPT-4o struggle with the benchmark, particularly when it comes to handling complex instructions and missing setup components, highlighting a need for further improvements in LLM reasoning and instruction following.

Affected Systems

GPT-4oDeepSeekCoder-V2

Date: Date not specified
Change type: capability
Severity: info

BigCodeBench: LLM Programming Benchmark Shows Limited Performance

More from Hugging Face

Get alerts for Hugging Face