InfoCapability

Datasets streaming 100x more efficient: caching, prefetching, and startup optimization in datasets library

AI Impact Summary

HF datasets library has rolled out a major streaming upgrade that reduces startup request storms by caching the data file list across DataLoader workers and consolidating file-resolution calls. In practice, this yields up to 100x fewer startup requests, 10x faster data-resolution, and up to 2x higher streaming throughput, with stable operation at 256 concurrent workers. Parquet prefetching and configurable buffering further improve GPU utilization by keeping data pipelines fed; the change remains backwards compatible, so existing code with streaming=True continues to work after upgrading datasets and huggingface_hub. Enterprises training on multi-terabyte datasets can now start training with less pre-download overhead and fewer storage bottlenecks.

Affected Systems

datasets libraryload_dataset API

Date: Date not specified
Change type: capability
Severity: info

Datasets streaming 100x more efficient: caching, prefetching, and startup optimization in datasets library

More from Hugging Face

Get alerts for Hugging Face