AraGen Benchmark and Leaderboard: 3C3H-based Arabic LLM Evaluation Framework
AI Impact Summary
AraGen introduces a dynamic benchmark and leaderboard for Arabic LLMs built on the 3C3H evaluation framework, combining factual accuracy and usability via an LLM-as-a-Judge. The evaluation pipeline runs three-month blind testing cycles with private datasets and code that are released after the cycle, which helps prevent data leakage and keeps benchmarks current. The framework covers six dimensions (Correctness, Completeness, Conciseness, Helpfulness, Honesty, Harmlessness) and pairs it with a Task Leaderboard (QA, reasoning, orthographic analysis, safety), pushing teams to optimize both knowledge and user-aligned behavior for Arabic models.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info