Text Generation Inference adds Intel Gaudi backend for production LLM inference
AI Impact Summary
Text Generation Inference (TGI) now includes a native Gaudi backend integrated into the mainline codebase, eliminating the need for a separate Gaudi fork. This enables production-grade LLM inference on Intel Gaudi hardware with TGI features like dynamic batching and streaming responses, plus FP8 quantization via Intel Neural Compressor. Supported models span Llama 3.1/3.3/3.2 Vision, Mistral, Mixtral, CodeLlama, Falcon, Qwen2, Starcoder, Gemma, Llava, and Phi-2, with multi-card sharding options. Deployment is simplified via the official Gaudi-enabled Docker image, broadening hardware options beyond GPUs and potentially improving cost-per-token for targeted workloads.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info