InfoCapability

Modular: Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA (1772 TFLOPs)

AI Impact Summary

The team has achieved a 15% performance improvement in matrix multiplication on NVIDIA Blackwell by leveraging Cluster Launch Control (CLC) scheduling. This involved implementing a persistent kernel that eliminates shared memory overhead and barrier restart issues, ultimately achieving 1772 TFLOPs. The key innovation is a producer-consumer model orchestrated by the GPU's hardware, which efficiently assigns idle SMs to work tiles, significantly reducing scheduling latency and enabling overlapping TMA loads and MMA operations.

Affected Systems

NVIDIA BlackwellCluster Launch Control (CLC)

Date: Date not specified
Change type: capability
Severity: info

Modular: Matrix Multiplication on Blackwell: Part 4 - Breaking SOTA (1772 TFLOPs)

More from Modular MAX

Get alerts for Modular MAX