TimeScope benchmark assesses long-video understanding for vision-language models
AI Impact Summary
TimeScope is an open-source benchmark that inserts short needles into base videos (1 minute to 8 hours) to evaluate localized retrieval, information synthesis, and fine-grained temporal perception in vision-language models. It exposes that many leading models struggle with true temporal comprehension and that simply scaling parameters does not extend the useful context horizon beyond roughly short clips. Gemini 2.5-Pro stands out by maintaining accuracy on videos longer than one hour, while other models like Qwen 2.5-VL and InternVL variants show task-specific strengths and weaknesses. The public Hugging Face Space and accompanying lmms_eval tooling will accelerate community benchmarking and highlight where training and data need to emphasize long-form temporal reasoning.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info