InfoCapability

3LM Benchmark: Arabic STEM and Code Evaluation for LLMs

AI Impact Summary

A new, Arabic-focused benchmarking suite (3LM) measures STEM reasoning and code generation across native Arabic content, synthetic problems, and translated code tasks. It combines OCR-based math parsing (Pix2Tex), LLM-assisted question generation (YourBench), and code benchmarks translated into Arabic, evaluated with EvalPlus and pass@1 metrics, highlighting both language-specific prompt quality effects and cross-language performance correlations. For engineering teams, this enables reproducible, cross-model comparison of Arabic STEM and coding capabilities and could drive roadmap prioritization toward Arabic-domain reasoning and code tooling, as well as integration into existing evaluation pipelines (HuggingFace tools, OpenAI/HF APIs).

Affected Systems

3LM BenchmarkYourBench

Date: Date not specified
Change type: capability
Severity: info

3LM Benchmark: Arabic STEM and Code Evaluation for LLMs

More from Hugging Face

Get alerts for Hugging Face