Deploy Meta Llama 3.1 405B on Google Cloud Vertex AI with Text Generation Inference
AI Impact Summary
Meta Llama 3.1-405B can be deployed on Vertex AI using Text Generation Inference (TGI) with Hugging Face DLCs, accommodating FP8 or multi-node configurations due to the 405B scale. Successful deployment hinges on obtaining HF access to the meta-llama model, selecting the correct TGI DLC (v2.2) and container, and provisioning substantial GPU quotas (8x H100s on A3 instances) in the target region. The deployment also requires attention to zone availability and memory requirements (e.g., 810 GB VRAM for FP16, 405 GB for FP8) to avoid runtime constraints in production.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- medium