Custom AMD MI300X kernels accelerate Llama 3.1 405B inference in VLLM
AI Impact Summary
The post describes open-source, AMD-optimized kernels for MI300X to accelerate Llama 3.1 405B inference in FP8 on an 8-GPU node using VLLM, including fused residual/RMS norm/FP8 conversion, fused SwiGLU with FP8, and a Skinny GEMM kernel. It provides a concrete migration path via the hf-rocm-kernels repo, Python bindings, and examples to bind CUDA-style kernels, with planned AMD-VLLM integration, enabling reproducible performance gains on AMD hardware. For teams operating large-scale chat deployments, adopting these kernels could materially reduce decoding latency and per-token energy, lowering total cost of ownership when running inference at scale on MI300X-based infrastructure.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info