InfoCapability

3LM: Arabic LLM Benchmark for STEM and Code

AI Impact Summary

The 3LM benchmark introduces a critical new resource for evaluating Arabic Large Language Models across STEM and code generation. This multi-component benchmark, comprised of Native STEM, Synthetic STEM, and Arabic Code benchmarks, addresses a significant gap in existing evaluations by focusing on structured reasoning and formal logic within Arabic. The benchmark’s creation process, leveraging OCR, LLM-assisted extraction, and human review, demonstrates a rigorous approach to data quality and representation, offering a valuable tool for advancing Arabic NLP research and development.

Affected Systems

GPT-4oQwen2.5-72B-Instruct

Date: Date not specified
Change type: capability
Severity: info

3LM: Arabic LLM Benchmark for STEM and Code

More from Hugging Face

Get alerts for Hugging Face