Hugging Face Hub: Improving Parquet dedupe via content-defined row groups and CDC
AI Impact Summary
Xet team at Hugging Face is evaluating improvements to Parquet deduplication on Hugging Face Hub to reduce storage growth and speed up repeated dataset updates. Parquet’s row-group layout causes edits to rewrite column headers, limiting dedupe effectiveness even when data blocks are largely unchanged. The proposals include content-defined row groups, relative offsets for file structures, and closer collaboration with Apache Arrow to implement changes in Parquet/Arrow code. Successful adoption would lower incremental storage for new dataset versions and accelerate update workflows for users.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info