InfoCapability

mmBERT: ModernBERT goes Multilingual with 1,833-language support and Gemma 2 tokenizer

AI Impact Summary

mmBERT extends ModernBERT-base with a 22-layer encoder and the Gemma 2 tokenizer to enable massively multilingual inference across 1,833 languages. The training pipeline uses a three-phase schedule on over 3T tokens and progressively shifts from language bias toward uniform multilingual sampling, delivering strong XTREME and XNLI gains and improved performance on code and retrieval tasks. Adopting this model offers broad language coverage and better low-resource language support, but will require alignment to the Gemma 2 tokenizer and potential updates to tokenization, vocabulary, and serving endpoints. Expect deployment considerations around model size, inference latency, and variant selection during production rollout.

Affected Systems

mmBERTModernBERT-base

Date: Date not specified
Change type: capability
Severity: info

mmBERT: ModernBERT goes Multilingual with 1,833-language support and Gemma 2 tokenizer

More from Hugging Face

Get alerts for Hugging Face