LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs — rethinking fine-tuning for DocVQA
AI Impact Summary
Zero-shot VQA evaluation on Docmatix with LLMs is explored to determine whether fine-tuning is still beneficial. The authors observed that fine-tuning Florence-2 on Docmatix improved DocVQA scores but degraded benchmark performance, necessitating further DocVQA-specific fine-tuning to match the benchmark—yet human evaluators preferred the simpler Docmatix-only model for broader use. They introduce LAVE (LLM-Assisted VQA Evaluation) as an alternative metric intended to align better with human judgments, suggesting traditional metrics (CIDER/ANLS/BLEU) may be overly restrictive for OOD-like, synthetic datasets. The experiment uses MPLUGDocOwl1.5 as baseline and Llama-2-Chat-7b for rating, and highlights that a semantic match does not guarantee high metric scores, raising the question of how to evaluate VQA systems in synthetic, out-of-distribution settings.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info