3LM Benchmark: Arabic STEM and Code Evaluation for LLMs
AI Impact Summary
A new, Arabic-focused benchmarking suite (3LM) measures STEM reasoning and code generation across native Arabic content, synthetic problems, and translated code tasks. It combines OCR-based math parsing (Pix2Tex), LLM-assisted question generation (YourBench), and code benchmarks translated into Arabic, evaluated with EvalPlus and pass@1 metrics, highlighting both language-specific prompt quality effects and cross-language performance correlations. For engineering teams, this enables reproducible, cross-model comparison of Arabic STEM and coding capabilities and could drive roadmap prioritization toward Arabic-domain reasoning and code tooling, as well as integration into existing evaluation pipelines (HuggingFace tools, OpenAI/HF APIs).
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info