Modular: Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA (1772 TFLOPs)
AI Impact Summary
The team has achieved a 15% performance improvement in matrix multiplication on NVIDIA Blackwell by leveraging Cluster Launch Control (CLC) scheduling. This involved implementing a persistent kernel that eliminates shared memory overhead and barrier restart issues, ultimately achieving 1772 TFLOPs. The key innovation is a producer-consumer model orchestrated by the GPU's hardware, which efficiently assigns idle SMs to work tiles, significantly reducing scheduling latency and enabling overlapping TMA loads and MMA operations.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info