InfoCapability

Optimum-NVIDIA on Hugging Face enables FP8-accelerated LLM inference on NVIDIA GPUs

AI Impact Summary

Optimum-NVIDIA on Hugging Face provides a one-line switch to enable FP8-quantized inference on NVIDIA GPUs, dramatically increasing LLM throughput and reducing latency. It leverages TensorRT-LLM and FP8 on Ada Lovelace and Hopper architectures to deliver up to 28x throughput and up to 3.3x faster first-token latency for LLaMA-family models such as meta-llama/Llama-2-7b-chat-hf and meta-llama/Llama-2-13b-chat-hf. The update is accessible via the optimum.nvidia pipelines and AutoModelForCausalLM integration, meaning teams can upgrade existing transformer-based pipelines with minimal code changes to realize significant performance gains for production inference.

Affected Systems

Optimum-NVIDIAHugging Face Transformers

Date: Date not specified
Change type: capability
Severity: info

Optimum-NVIDIA on Hugging Face enables FP8-accelerated LLM inference on NVIDIA GPUs

More from Hugging Face

Get alerts for Hugging Face