InfoCapability

Judge Arena benchmarks LLMs as evaluators with hourly Elo leaderboard

AI Impact Summary

Judge Arena provides a competitive framework to compare LLMs as evaluators by running two candidate judges on test samples, capturing their scores and critiques, and letting users vote on alignment with human judgment. An Elo-based leaderboard updates hourly across 18 models from OpenAI, Anthropic, Meta, Alibaba, Google, and Mistral, with model names hidden until after voting to minimize bias. The platform enables teams to select evaluation judges that best reflect human preferences for their evaluation pipelines, while offering anonymized data sharing to accelerate research. Stakeholders should plan for integration with CI/test workflows and guard against potential evaluation bias or leakage between rounds.

Affected Systems

Judge ArenaGPT-4o

Date: Date not specified
Change type: capability
Severity: info

Judge Arena benchmarks LLMs as evaluators with hourly Elo leaderboard

More from Hugging Face

Get alerts for Hugging Face