InfoCapability

Bringing VLA models to i.MX 95 embedded platforms — dataset recording, ACT/SmolVLA fine-tuning, and on-device optimization

AI Impact Summary

The document outlines bringing Vision-Language-Action (VLA) models to embedded robotics hardware using asynchronous inference to keep execution latency within the control loop, which is critical for real-time motion correction. It provides concrete data-collection and fine-tuning guidance for ACT and SmolVLA, including a 120-episode dataset with three cameras at 640x480p30 and best practices to ensure robust evaluation. It also details on-device optimization for the NXP i.MX 95 by decomposing the VLA graph into encoders, decoders, and an action expert, plus per-block quantization to balance latency and accuracy on edge hardware.

Affected Systems

Vision-Language-Action (VLA) modelsACT

Date: Date not specified
Change type: capability
Severity: info

Bringing VLA models to i.MX 95 embedded platforms — dataset recording, ACT/SmolVLA fine-tuning, and on-device optimization

More from Hugging Face

Get alerts for Hugging Face