OpenAI GPT-OSS integration with Transformers adds MXFP4 quantization and prebuilt kernels
AI Impact Summary
OpenAI has released GPT-OSS models with MXFP4 quantization and a consolidated kernel ecosystem, and wired these capabilities into the transformers workflow to enable loading, running, and fine-tuning GPT-OSS 20B/120B with prebuilt kernels from the Hub. The update introduces zero-build kernels, Flash Attention 3, and MoE-specific kernels (e.g., MegaBlocksMoeMLP, Liger RMSNorm), which can significantly reduce memory footprint and improve throughput on supported GPUs, provided you opt-in via use_kernels. There are compatibility nuances: MXFP4 uses dedicated Triton kernels, while some kernel paths are not compatible with MXFP4 and may force inference to bf16; teams should verify model config (quant_method) and ensure prerequisites (accelerate, kernels, and Triton >= 3.4) are in place before production rollout.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info