Running Mistral 7B on-device with Core ML (WWDC 24)
AI Impact Summary
WWDC24 demonstrates on-device LLM inference on Apple Silicon using Core ML, enabling private, low-latency AI within apps. The workflow leverages a fork of swift-transformers to port a 7B model to Core ML, employing new APIs like MLTensor and StatefulBuffers, kv-cache, and block-wise 4-bit quantization to fit within roughly 4GB RAM on Macs. The article outlines concrete steps to reproduce: clone the preview branch of swift-transformers, download converted Core ML models from Hugging Face, and run inference via Swift, indicating a practical on-device migration path for 7B-scale models on macOS/iOS 18+. This capability paves the way for privacy-preserving, offline AI features in consumer apps, but requires Apple Silicon hardware and updated Core ML tooling across the stack.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info