Hugging Face Hub improves Parquet dedupe with content-defined row groups and relative offsets
AI Impact Summary
Hugging Face's Xet team is targeting storage efficiency for Parquet files on the Hub, aiming to keep multiple dataset versions compact as storage scales (11PB total, 2.2PB Parquet). Their experiments show that append-heavy updates dedupe well, but in-place row modifications and deletions trigger widespread rewrites of Parquet headers and row-group layouts, reducing dedupe gains. They propose changes such as content-defined row groups and relative offsets, plus potential collaboration with Apache Arrow to implement these ideas, which would require changes to Parquet writers and ingestion workflows.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info