3LM: Arabic LLM Benchmark for STEM and Code
AI Impact Summary
The 3LM benchmark introduces a critical new resource for evaluating Arabic Large Language Models across STEM and code generation. This multi-component benchmark, comprised of Native STEM, Synthetic STEM, and Arabic Code benchmarks, addresses a significant gap in existing evaluations by focusing on structured reasoning and formal logic within Arabic. The benchmark’s creation process, leveraging OCR, LLM-assisted extraction, and human review, demonstrates a rigorous approach to data quality and representation, offering a valuable tool for advancing Arabic NLP research and development.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info