InfoCapability

Text Generation Inference adds Intel Gaudi backend for production LLM inference

AI Impact Summary

Text Generation Inference (TGI) now includes a native Gaudi backend integrated into the mainline codebase, eliminating the need for a separate Gaudi fork. This enables production-grade LLM inference on Intel Gaudi hardware with TGI features like dynamic batching and streaming responses, plus FP8 quantization via Intel Neural Compressor. Supported models span Llama 3.1/3.3/3.2 Vision, Mistral, Mixtral, CodeLlama, Falcon, Qwen2, Starcoder, Gemma, Llava, and Phi-2, with multi-card sharding options. Deployment is simplified via the official Gaudi-enabled Docker image, broadening hardware options beyond GPUs and potentially improving cost-per-token for targeted workloads.

Affected Systems

Text Generation Inference (TGI)Intel Gaudi

Date: Date not specified
Change type: capability
Severity: info

Text Generation Inference adds Intel Gaudi backend for production LLM inference

More from Hugging Face

Get alerts for Hugging Face