Microsoft Florence-2 fine-tuned on DocVQA achieves 57.0% similarity
AI Impact Summary
Microsoft's Florence-2 model, a vision-language model with a small size (0.2B and 0.7B), is being fine-tuned for improved performance on tasks like DocVQA. The initial model lacked VQA capability, but fine-tuning on a newly created FLD-5B dataset, combined with region-to-description prompting, resulted in a significant performance boost from 0% similarity to 57.0% after seven epochs. This demonstrates the potential for adapting Florence-2 to specific downstream tasks through targeted fine-tuning, though further refinement with models like The Cauldron is recommended for optimal VQA performance.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info