Bringing VLA models to i.MX 95 embedded platforms — dataset recording, ACT/SmolVLA fine-tuning, and on-device optimization
AI Impact Summary
The document outlines bringing Vision-Language-Action (VLA) models to embedded robotics hardware using asynchronous inference to keep execution latency within the control loop, which is critical for real-time motion correction. It provides concrete data-collection and fine-tuning guidance for ACT and SmolVLA, including a 120-episode dataset with three cameras at 640x480p30 and best practices to ensure robust evaluation. It also details on-device optimization for the NXP i.MX 95 by decomposing the VLA graph into encoders, decoders, and an action expert, plus per-block quantization to balance latency and accuracy on edge hardware.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info