InfoCapability

3LM Benchmark: Arabic LLMs in STEM and Code Evaluation

AI Impact Summary

3LM provides a dedicated Arabic-domain benchmark suite for STEM reasoning and code generation, filling a gap after English-centric benchmarks. It assembles three datasets (Native STEM MCQs, Synthetic STEM MCQs, Arabic Code Benchmarks) and uses cross-domain data generation and validation (OCR+LaTeX via Pix2Tex, GPT-4o translation, backtranslation, and EvalPlus scoring). This enables rigorous, repeatable evaluation of Arabic LLMs against structured reasoning and programming tasks, with clear signals for where models lag and what prompts or training data may bridge gaps. The integration with HuggingFace Datasets/Transformers and open-source tooling makes it easy to embed in model evaluation pipelines and CI, supporting progress tracking across models like Qwen2.5-72B-Instruct, Gemma-3-27B, and GPT-4o.

Affected Systems

3LM benchmarkYourBench

Date: Date not specified
Change type: capability
Severity: info

3LM Benchmark: Arabic LLMs in STEM and Code Evaluation

More from Hugging Face

Get alerts for Hugging Face