AWS Inferentia2 accelerates Hugging Face Transformers on Inf2 instances
AI Impact Summary
Hugging Face and AWS have integrated Inferentia2 to run Hugging Face Transformers with significantly higher throughput and lower latency. Benchmark data indicates Inf2-based deployments outperform Inferentia1 and NVIDIA A10G GPUs by roughly 4x in p95 latency, with Inf2.xlarge and larger configurations enabling large models up to 175B parameters on multi-chip Inf2 instances. The integration leverages AWS Neuron SDK for a minimal code change (single-line compile), reducing deployment complexity for production inference of models like BERT, RoBERTa, ViT, and BLOOM.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info