LAVE: LLM-Assisted VQA Evaluation on Docmatix — rigid metrics hinder zero-shot performance
AI Impact Summary
The LAVE evaluation highlights a critical issue in VQA: current metrics like CIDER, ANLS, and BLEU are overly restrictive when evaluating zero-shot performance on synthetic datasets like Docmatix. The study demonstrates that LLMs can accurately assess answers even if they deviate from strict reference answer formats, suggesting a need to shift away from overly rigid evaluation methods. This finding has significant implications for the development and deployment of VQA models, particularly those trained on synthetic data.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info