3LM Benchmark: Arabic LLMs in STEM and Code Evaluation
AI Impact Summary
3LM provides a dedicated Arabic-domain benchmark suite for STEM reasoning and code generation, filling a gap after English-centric benchmarks. It assembles three datasets (Native STEM MCQs, Synthetic STEM MCQs, Arabic Code Benchmarks) and uses cross-domain data generation and validation (OCR+LaTeX via Pix2Tex, GPT-4o translation, backtranslation, and EvalPlus scoring). This enables rigorous, repeatable evaluation of Arabic LLMs against structured reasoning and programming tasks, with clear signals for where models lag and what prompts or training data may bridge gaps. The integration with HuggingFace Datasets/Transformers and open-source tooling makes it easy to embed in model evaluation pipelines and CI, supporting progress tracking across models like Qwen2.5-72B-Instruct, Gemma-3-27B, and GPT-4o.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info