ONNX Runtime accelerates 130k Hugging Face models with up to 74% latency improvement
AI Impact Summary
ONNX Runtime is enabling acceleration for a large portion of Hugging Face models by providing a cross-platform inference backend. The platform currently supports over 90 architectures and more than 130,000 ONNX-exportable models, including BERT, GPT2, DistilBERT, RoBERTa, T5, Wav2Vec2, Whisper, and Stable-Diffusion, expanding the available deployment options. Inference with Whisper-tiny demonstrates a latency reduction of about 74.3% versus PyTorch, illustrating substantial cost and throughput benefits at scale. Enterprises should validate their model exportability to ONNX and benchmark critical workloads to realize these gains in production.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info