InfoCapability

AWS Inferentia2 accelerates Hugging Face Transformers on Inf2 instances

AI Impact Summary

Hugging Face and AWS have integrated Inferentia2 to run Hugging Face Transformers with significantly higher throughput and lower latency. Benchmark data indicates Inf2-based deployments outperform Inferentia1 and NVIDIA A10G GPUs by roughly 4x in p95 latency, with Inf2.xlarge and larger configurations enabling large models up to 175B parameters on multi-chip Inf2 instances. The integration leverages AWS Neuron SDK for a minimal code change (single-line compile), reducing deployment complexity for production inference of models like BERT, RoBERTa, ViT, and BLOOM.

Affected Systems

AWS Inferentia2Inf2 instances

Date: Date not specified
Change type: capability
Severity: info

AWS Inferentia2 accelerates Hugging Face Transformers on Inf2 instances

More from Hugging Face

Get alerts for Hugging Face