InfoCapability

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

AI Impact Summary

SmolVLA is an open-source, compact Vision-Language-Action model trained on the Lerobot community dataset, offering a significant opportunity for robotics research. The model’s key features – a 450M parameter size, asynchronous inference, and training on affordable hardware – democratize access to VLAs and accelerate research toward generalist robotic agents. The model’s architecture, combining a SmolVLM2 VLM with a flow-matching transformer action expert, demonstrates a novel approach to efficient and robust action prediction.

Affected Systems

SmolVLASmolVLM2

Date: Date not specified
Change type: capability
Severity: info

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

More from Hugging Face

Get alerts for Hugging Face