Modular MAX 25.2: Multi-GPU H200/H100 Support & CUDA-Free Inference
AI Impact Summary
Modular MAX 25.2 introduces multi-GPU support for NVIDIA H100 and H200 hardware, enabling the deployment of large language models like Llama-3.3-70B-Instruct across multiple GPUs. This release significantly expands model support to over 500 preconfigured models and incorporates features like GPTQ quantization and optimized LLM serving techniques (batch scheduling, in-flight batching, copy-on-write KV blocks) to improve performance and reduce TCO. The slim Docker container (1.3GB compressed) further accelerates deployment, eliminating CUDA dependencies and offering a simplified GPU programming experience with Mojo.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info