Train an Esperanto RoBERTa-like model from scratch using Transformers and Tokenizers (EsperBERTo)
AI Impact Summary
This post outlines a concrete end‑to‑end workflow to train a RoBERTa‑like language model from scratch using Transformers and Tokenizers, targeting Esperanto with an 84M-parameter config. It covers assembling a ~3 GB corpus (OSCAR Esperanto plus Leipzig), training a ByteLevelBPETokenizer with a 52k vocabulary, and training via run_language_modeling.py with model_name_or_path=None to start from scratch. For teams, this enables building a language-specific MLM and downstream POS tagging capability (via fine-tuning EsperBERTo-small) without relying on English pre-trained models, though it requires compute and careful data handling to achieve quality results.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info