InfoCapability

TimeScope benchmark reveals plateau in long-video understanding for leading vision-language models

AI Impact Summary

TimeScope is an open-source benchmark on Hugging Face that inserts short video needles (5–10 seconds) into base videos ranging from 1 minute to 8 hours to evaluate localized retrieval, information synthesis, and fine-grained temporal perception. The study shows that most leading vision-language models plateau in long-video tasks, with Gemini 2.5-Pro maintaining accuracy on videos longer than one hour while other models (e.g., Qwen 2.5-VL variants, InternVL 2.5) exhibit similar curves to smaller counterparts. This implies real-world long-duration video understanding remains a bottleneck for applications requiring extended narrative reasoning or precise motion analysis, even as model sizes grow.

Affected Systems

Gemini 2.5-ProQwen 2.5-VL 3B

Date: Date not specified
Change type: capability
Severity: info

TimeScope benchmark reveals plateau in long-video understanding for leading vision-language models

More from Hugging Face

Get alerts for Hugging Face