Hugging Face Hub: ML-based language metadata tagging for datasets (fastText)
AI Impact Summary
The post outlines an experimental capability to infer language metadata for Hugging Face Hub datasets by sampling text from dataset pages via the dataset viewer API, running it through the facebook/fasttext-language-identification model, and then mapping results to ISO language codes before proposing metadata updates via librarian-bots. This addresses the current gap where ~87% of datasets lack language metadata, dramatically improving discoverability through language-based filtering and aiding model/data selection. Critical considerations include handling multi-language datasets, accuracy of language predictions, and governance around auto-suggested metadata and PR reviews to avoid incorrect labeling.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info