InfoCapability

Falcon-Arabic 7B: 32k-context Arabic LLM with SFT and DPO alignment

AI Impact Summary

Falcon-Arabic is a 7B parameter model built on Falcon 3, expanded with a 32k token context and Arabic-specific tokenizer/embeddings, enabling long-document processing and retrieval-augmented generation. The approach extends the tokenizer by 32,000 Arabic tokens and uses a novel embedding initialization to bootstrap learning from related embeddings, followed by continuous pretraining on high-quality native Arabic data. The pipeline includes supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) to improve instruction-following and alignment, producing Falcon-Arabic Instruct variants. In OALL v2 benchmarks, Falcon-Arabic outperforms existing Arabic LLMs in its size class and even some models several times larger, signaling strong value for Arabic-first deployments and faster ROI for Arabic-language AI initiatives.

Affected Systems

Falcon-Arabic

Date: Date not specified
Change type: capability
Severity: info

Falcon-Arabic 7B: 32k-context Arabic LLM with SFT and DPO alignment

More from Hugging Face

Get alerts for Hugging Face