Judge Arena benchmarks LLMs as evaluators with hourly Elo leaderboard
AI Impact Summary
Judge Arena provides a competitive framework to compare LLMs as evaluators by running two candidate judges on test samples, capturing their scores and critiques, and letting users vote on alignment with human judgment. An Elo-based leaderboard updates hourly across 18 models from OpenAI, Anthropic, Meta, Alibaba, Google, and Mistral, with model names hidden until after voting to minimize bias. The platform enables teams to select evaluation judges that best reflect human preferences for their evaluation pipelines, while offering anonymized data sharing to accelerate research. Stakeholders should plan for integration with CI/test workflows and guard against potential evaluation bias or leakage between rounds.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info