Docmatix DocVQA dataset released: 2.4M images, 9.5M Q/A; 20% improvement for Florence-2 fine-tuning
AI Impact Summary
Docmatix is a DocVQA dataset 240x larger than prior corpora: 2.4 million images and 9.5 million Q/A pairs derived from 1.3 million PDFs, generated from the PDFA OCR corpus using a Phi-3-small model and filtered to remove hallucinations. The pipeline converts PDFs to 150 dpi images and publishes assets on the Hugging Face Hub to enable reproducible fine-tuning workflows. In ablations, fine-tuning Florence-2 on Docmatix yields roughly a 20% relative uplift on DocVQA tasks; a 0.7B Florence-2 variant approaches the performance of an 8B Idefics2 model when trained with this data. This release narrows the gap between open-source and closed-source VLMs for DocVQA and provides a scalable resource for the community, though it implies substantial storage and provenance considerations for the large image assets.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info