BigCodeBench: Next-gen programming benchmark for LLMs with 1,140 tasks across 139 libraries
AI Impact Summary
BigCodeBench introduces a broad, real-world oriented programming benchmark with 1,140 function-level tasks that require chaining 139 libraries and handling diverse I/O, error cases, and test coverage. It emphasizes contamination resistance and rigorous evaluation, using calibrated Pass@1 and Elo-based rankings to gauge generalization across models such as GPT-4o. The project provides a PyPI-based evaluation framework and public leaderboards on Hugging Face Spaces, enabling teams to integrate standardized benchmarking into model development and release cycles.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info