MediumCapability

Deploy Meta Llama 3.1 405B on Google Cloud Vertex AI with Text Generation Inference

AI Impact Summary

Meta Llama 3.1-405B can be deployed on Vertex AI using Text Generation Inference (TGI) with Hugging Face DLCs, accommodating FP8 or multi-node configurations due to the 405B scale. Successful deployment hinges on obtaining HF access to the meta-llama model, selecting the correct TGI DLC (v2.2) and container, and provisioning substantial GPU quotas (8x H100s on A3 instances) in the target region. The deployment also requires attention to zone availability and memory requirements (e.g., 810 GB VRAM for FP16, 405 GB for FP8) to avoid runtime constraints in production.

Affected Systems

Google Cloud Vertex AIText Generation Inference (TGI)

Date: Date not specified
Change type: capability
Severity: medium

Deploy Meta Llama 3.1 405B on Google Cloud Vertex AI with Text Generation Inference

More from Hugging Face

Get alerts for Hugging Face