OpenVINO GenAI with Optimum-Intel for edge deployment of Hugging Face Transformers
AI Impact Summary
The article describes a capability to optimize and deploy Hugging Face Transformers using Optimum-Intel and OpenVINO GenAI for edge/client scenarios. It covers exporting models to OpenVINO IR, applying weight-only quantization (INT8/INT4) via NNCF with AWQ, and deploying through OpenVINO GenAI's Python and C++ APIs (LLMPipeline), targeting Intel hardware. This enables lower latency and a smaller deployment footprint at the edge, but requires updating model export, quantization, and deployment pipelines to the OpenVINO GenAI workflow.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info