InfoCapability

Custom AMD MI300X kernels accelerate Llama 3.1 405B inference in VLLM

AI Impact Summary

The post describes open-source, AMD-optimized kernels for MI300X to accelerate Llama 3.1 405B inference in FP8 on an 8-GPU node using VLLM, including fused residual/RMS norm/FP8 conversion, fused SwiGLU with FP8, and a Skinny GEMM kernel. It provides a concrete migration path via the hf-rocm-kernels repo, Python bindings, and examples to bind CUDA-style kernels, with planned AMD-VLLM integration, enabling reproducible performance gains on AMD hardware. For teams operating large-scale chat deployments, adopting these kernels could materially reduce decoding latency and per-token energy, lowering total cost of ownership when running inference at scale on MI300X-based infrastructure.

Affected Systems

MI300XVLLM

Date: Date not specified
Change type: capability
Severity: info

Custom AMD MI300X kernels accelerate Llama 3.1 405B inference in VLLM

More from Hugging Face

Get alerts for Hugging Face