AMD MI300X Custom Kernels: Llama 3.1 405B FP8 Optimization
AI Impact Summary
The creation of custom kernels for the AMD MI300X GPU is a significant effort focused on optimizing inference performance, particularly for large language models like Llama 3.1 405B. This involves fine-tuning kernels for operations such as Fused residual connection, RMS norm, and FP8 conversion, alongside GEMM and SwiGLU activation, to achieve speedups when running VLLM. This work leverages the MI300X’s architecture, including compute units, thread blocks, and XCDs, to maximize throughput and efficiency.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info