InfoCapability

ONNX Runtime accelerates 130k Hugging Face models with up to 74% latency improvement

AI Impact Summary

ONNX Runtime is enabling acceleration for a large portion of Hugging Face models by providing a cross-platform inference backend. The platform currently supports over 90 architectures and more than 130,000 ONNX-exportable models, including BERT, GPT2, DistilBERT, RoBERTa, T5, Wav2Vec2, Whisper, and Stable-Diffusion, expanding the available deployment options. Inference with Whisper-tiny demonstrates a latency reduction of about 74.3% versus PyTorch, illustrating substantial cost and throughput benefits at scale. Enterprises should validate their model exportability to ONNX and benchmark critical workloads to realize these gains in production.

Affected Systems

ONNX RuntimeHugging Face Hub

Date: Date not specified
Change type: capability
Severity: info

ONNX Runtime accelerates 130k Hugging Face models with up to 74% latency improvement

More from Hugging Face

Get alerts for Hugging Face