ScreenSuite GUI Agent Benchmarking Suite launches 13-task, vision-only evaluation across desktop and mobile environments
AI Impact Summary
ScreenSuite introduces a modular benchmark suite for GUI agents, emphasizing vision-only evaluation across 13 tasks and multi-OS environments via Docker and desktop/mobile sandboxes. It leverages smolagents for orchestration and benchmarks models including Qwen2.5-VL-72B, UI-Tars-1.5-7B, Holo1-7B, and GPT-4o, enabling measured localization and click-precision in GUI tasks. By avoiding accessibility trees or DOM data, the suite creates a harder, more human-like evaluation that can reveal gaps in current GUI-agent capabilities and inform model selection and optimization for desktop automation across Windows, Ubuntu Desktop, and Android.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info