InfoCapability

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs — rethinking fine-tuning for DocVQA

AI Impact Summary

Zero-shot VQA evaluation on Docmatix with LLMs is explored to determine whether fine-tuning is still beneficial. The authors observed that fine-tuning Florence-2 on Docmatix improved DocVQA scores but degraded benchmark performance, necessitating further DocVQA-specific fine-tuning to match the benchmark—yet human evaluators preferred the simpler Docmatix-only model for broader use. They introduce LAVE (LLM-Assisted VQA Evaluation) as an alternative metric intended to align better with human judgments, suggesting traditional metrics (CIDER/ANLS/BLEU) may be overly restrictive for OOD-like, synthetic datasets. The experiment uses MPLUGDocOwl1.5 as baseline and Llama-2-Chat-7b for rating, and highlights that a semantic match does not guarantee high metric scores, raising the question of how to evaluate VQA systems in synthetic, out-of-distribution settings.

Affected Systems

Docmatix

Date: Date not specified
Change type: capability
Severity: info

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs — rethinking fine-tuning for DocVQA

More from Hugging Face

Get alerts for Hugging Face