Databricks and Hugging Face: Dataset.from_spark enables up to 40% faster LLM training
AI Impact Summary
Databricks and Hugging Face now support loading a Spark dataframe directly into a Hugging Face Dataset via the Dataset.from_spark API, enabling faster data prep for LLM training and tuning. By combining Spark-based transformations with Hugging Face’s memory-mapped datasets, teams can cut preprocessing time, demonstrated by a 16GB example dropping from 22 minutes to about 12 minutes. This accelerates fine-tuning workflows for Dolly and the databricks-dolly-15k dataset, reducing total iteration time and lowering compute costs per training run.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info