Datasets streaming 100x more efficient: caching, prefetching, and startup optimization in datasets library
AI Impact Summary
HF datasets library has rolled out a major streaming upgrade that reduces startup request storms by caching the data file list across DataLoader workers and consolidating file-resolution calls. In practice, this yields up to 100x fewer startup requests, 10x faster data-resolution, and up to 2x higher streaming throughput, with stable operation at 256 concurrent workers. Parquet prefetching and configurable buffering further improve GPU utilization by keeping data pipelines fed; the change remains backwards compatible, so existing code with streaming=True continues to work after upgrading datasets and huggingface_hub. Enterprises training on multi-terabyte datasets can now start training with less pre-download overhead and fewer storage bottlenecks.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info