Smolagents adds vision support with vision-language models (VLMs) for autonomous agent workflows
AI Impact Summary
Smolagents now includes vision support, enabling vision-language models (VLMs) to operate inside agentic pipelines and interpret visual content beyond text. The change supports passing images at startup or dynamically via callbacks, storing them in task_images and step_log.observation_images, which enables image-driven actions during tasks like web navigation. This leverages the existing ReAct-inspired MultiStepAgent framework and model classes (TransformersModel, CodeAgent, OpenAIServerModel) and points to specific VLM usage such as HuggingFaceTB/SmolVLM-Instruct; expect added compute and latency costs and the need to carefully configure image input paths (including flatten_messages_as_text for VLMs) in deployment.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info