InfoCapability

Smolagents adds vision support with vision-language models (VLMs) for autonomous agent workflows

AI Impact Summary

Smolagents now includes vision support, enabling vision-language models (VLMs) to operate inside agentic pipelines and interpret visual content beyond text. The change supports passing images at startup or dynamically via callbacks, storing them in task_images and step_log.observation_images, which enables image-driven actions during tasks like web navigation. This leverages the existing ReAct-inspired MultiStepAgent framework and model classes (TransformersModel, CodeAgent, OpenAIServerModel) and points to specific VLM usage such as HuggingFaceTB/SmolVLM-Instruct; expect added compute and latency costs and the need to carefully configure image input paths (including flatten_messages_as_text for VLMs) in deployment.

Affected Systems

smolagentsReAct framework

Date: Date not specified
Change type: capability
Severity: info

Smolagents adds vision support with vision-language models (VLMs) for autonomous agent workflows

More from Hugging Face

Get alerts for Hugging Face