InfoCapability

HuggingFace datasets streaming upgrade: 100x fewer startup requests and 2x throughput via load_dataset(streaming=True)

AI Impact Summary

Streaming improvements in the datasets library enable load_dataset(..., streaming=True) to fetch data from the Hub without full downloads, dramatically reducing startup request storms by caching file lists across DataLoader workers and bundling initial API calls. The introduction of Parquet prefetching and configurable buffering keeps the GPU pipeline saturated, delivering up to 2x faster streaming and enabling training on multi-terabyte datasets directly from HuggingFace Hub. This touches the datasets library, load_dataset API, ParquetFragmentScanOptions, HfFileSystem in huggingface_hub, and Xet-based dedupe storage, reflecting a shift toward zero-download data access for large-scale ML workflows.

Affected Systems

datasets libraryload_dataset API

Date: Date not specified
Change type: capability
Severity: info

HuggingFace datasets streaming upgrade: 100x fewer startup requests and 2x throughput via load_dataset(streaming=True)

More from Hugging Face

Get alerts for Hugging Face