InfoCapability

Databricks and Hugging Face: Dataset.from_spark enables up to 40% faster LLM training

AI Impact Summary

Databricks and Hugging Face now support loading a Spark dataframe directly into a Hugging Face Dataset via the Dataset.from_spark API, enabling faster data prep for LLM training and tuning. By combining Spark-based transformations with Hugging Face’s memory-mapped datasets, teams can cut preprocessing time, demonstrated by a 16GB example dropping from 22 minutes to about 12 minutes. This accelerates fine-tuning workflows for Dolly and the databricks-dolly-15k dataset, reducing total iteration time and lowering compute costs per training run.

Affected Systems

Hugging Face DatasetsDataset.from_spark

Date: Date not specified
Change type: capability
Severity: info

Databricks and Hugging Face: Dataset.from_spark enables up to 40% faster LLM training

More from Hugging Face

Get alerts for Hugging Face