OpenAI gpt-oss: New Kernel Optimizations for Faster Inference
Action Required
Users can now run larger GPT-OSS models on a single GPU, significantly improving inference speed and reducing hardware costs.
AI Impact Summary
OpenAI has released a set of optimizations for their gpt-oss models, primarily focused on accelerating inference through custom kernels. These optimizations include MXFP4 quantization (a 4-bit floating-point format), Flash Attention 3 with attention sinks, and the use of community-contributed kernels from the Hub. This allows users to run larger models like GPT-OSS 20B and 120B on a single GPU, significantly improving performance and reducing memory requirements. The release also introduces a streamlined way to deploy quantized models directly to the Hugging Face Hub.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- high