InfoCapability

Deploy models on AWS Inferentia2 via Hugging Face Inference Endpoints

AI Impact Summary

Hugging Face and AWS are enabling deployment of HF models on AWS Inferentia2 via SageMaker and Hugging Face Inference Endpoints, broadening hardware acceleration to over 100k public HF models and 14 architectures. The Inf2 chips with Text Generation Inference (TGI) offer scalable, cost-efficient LLM inference with per-second billing and scale-to-zero, with entry and high-end flavors (Inf2-small and Inf2-xlarge) suitable for Llama 3 workloads. This integration, leveraging Neuron/Neuronx acceleration and OpenAI SDK Messages API compatibility, lowers friction for production deployments and accelerates time-to-market for large-scale inference workloads. Business impact centers on lower cost-per-request and improved latency for LLM deployments across HF models, enabling easier per-endpoint scaling and broader model support in production environments.

Affected Systems

AWS Inferentia2

Date: Date not specified
Change type: capability
Severity: info

Deploy models on AWS Inferentia2 via Hugging Face Inference Endpoints

More from Hugging Face

Get alerts for Hugging Face