Arabic Leaderboards update: AraGen-03-25 improvements, Arabic Instruction Following, MBZUAI collaboration
AI Impact Summary
Arabic Leaderboards now centralize Arabic evaluations under the MBZUAI-backed Arabic-Leaderboards Space, expanding live leaderboards to AraGen-03-25 and Arabic Instruction Following. The AraGen-03-25 release increases the dataset to 340 QA-style items across QA, reasoning, safety, and orthographic tasks, with a refined judge prompt and a private three-month testing window to ensure fairness. Dynamic evaluation reveals ranking shifts when dataset and prompts change (notably gpt-4o-2024-08-06 and Claude variants) while top performer o1-2024-12-17 remains dominant, signaling the evaluation framework is becoming more sensitive to prompt design and data versions. This expansion raises the bar for Arabic LLM benchmarking and will require teams to re-baseline models against the updated benchmarks and potentially re-tune systems to maximize scores on Arabic Instruction Following.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info