InfoCapability

ScreenSuite GUI Agent Benchmarking Suite launches 13-task, vision-only evaluation across desktop and mobile environments

AI Impact Summary

ScreenSuite introduces a modular benchmark suite for GUI agents, emphasizing vision-only evaluation across 13 tasks and multi-OS environments via Docker and desktop/mobile sandboxes. It leverages smolagents for orchestration and benchmarks models including Qwen2.5-VL-72B, UI-Tars-1.5-7B, Holo1-7B, and GPT-4o, enabling measured localization and click-precision in GUI tasks. By avoiding accessibility trees or DOM data, the suite creates a harder, more human-like evaluation that can reveal gaps in current GUI-agent capabilities and inform model selection and optimization for desktop automation across Windows, Ubuntu Desktop, and Android.

Affected Systems

Qwen2.5-VL-72BUI-Tars-1.5-7B

Date: Date not specified
Change type: capability
Severity: info

ScreenSuite GUI Agent Benchmarking Suite launches 13-task, vision-only evaluation across desktop and mobile environments

More from Hugging Face

Get alerts for Hugging Face