LiveCodeBench Leaderboard: Contamination-free, time-based evaluation of Code LLMs across four scenarios
AI Impact Summary
LiveCodeBench introduces a holistic, contamination-free benchmark for code LLMs, using problem release dates from LeetCode, AtCoder, and CodeForces to enable evaluation over time and detect data leakage. It assesses four coding scenarios—Code Generation, Self Repair, Code Execution, and Test Output Prediction—with an execution-based correctness metric (Pass@1), providing a more robust view of real-world coding capabilities than standard benchmarks. This matters to technical teams because it offers a consistent, time-aware scoring framework and actionable insights across models (e.g., GPT-4-Turbo, Claude-3-Opus, Mistral-Large) and evaluation tooling (LiveCodeBench repository, lcb_runner) to guide model selection and benchmarking strategies.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info