Falcon-Arabic 7B: 32k-context Arabic LLM with SFT and DPO alignment
AI Impact Summary
Falcon-Arabic is a 7B parameter model built on Falcon 3, expanded with a 32k token context and Arabic-specific tokenizer/embeddings, enabling long-document processing and retrieval-augmented generation. The approach extends the tokenizer by 32,000 Arabic tokens and uses a novel embedding initialization to bootstrap learning from related embeddings, followed by continuous pretraining on high-quality native Arabic data. The pipeline includes supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) to improve instruction-following and alignment, producing Falcon-Arabic Instruct variants. In OALL v2 benchmarks, Falcon-Arabic outperforms existing Arabic LLMs in its size class and even some models several times larger, signaling strong value for Arabic-first deployments and faster ROI for Arabic-language AI initiatives.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info