TimeScope benchmark reveals plateau in long-video understanding for leading vision-language models
AI Impact Summary
TimeScope is an open-source benchmark on Hugging Face that inserts short video needles (5–10 seconds) into base videos ranging from 1 minute to 8 hours to evaluate localized retrieval, information synthesis, and fine-grained temporal perception. The study shows that most leading vision-language models plateau in long-video tasks, with Gemini 2.5-Pro maintaining accuracy on videos longer than one hour while other models (e.g., Qwen 2.5-VL variants, InternVL 2.5) exhibit similar curves to smaller counterparts. This implies real-world long-duration video understanding remains a bottleneck for applications requiring extended narrative reasoning or precise motion analysis, even as model sizes grow.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info