WWDC 24: Running Mistral 7B on-device with Core ML
AI Impact Summary
WWDC24 showcases on-device LLM deployment by running Mistral 7B through Core ML on Apple Silicon, enabling private, low-latency inference within iOS 18 and macOS Sequoia. It highlights a full toolchain: a fork of swift-transformers, converted Core ML models, and the Swift CLI to run inference, plus developments like MLTensor, StatefulBuffers, and kv-cache that reduce memory bandwidth pressure for 7B models. For technical teams, this indicates a viable on-device path for mid-sized LLMs, with memory footprints around 4GB on Mac hardware and opportunities to cut cloud/latency costs, provided you manage model conversion, quantization (4-bit), and Core ML tooling compatibility. Migration considerations include adopting the Core ML conversion workflow, using the new Swift tensor APIs, and supporting Apple Silicon capabilities (CPU/GPU/Neural Engine) across iOS 18/macOS Sequoia.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info