InfoCapability

Whisper on Hugging Face Inference Endpoints achieves up to 8x speed using vLLM on Ada GPUs

AI Impact Summary

The release enables a new Whisper endpoint on Hugging Face Inference Endpoints powered by the vLLM stack, targeting Ada Lovelace GPUs (e.g., L4, L40s) for significant throughput gains. The stack employs PyTorch torch.compile, CUDA graphs, and aggressive KV-cache quantization (bf16, float8) to boost inference performance while maintaining comparable Word Error Rates across Whisper Large V3, Large V3-Turbo, and Distil-Whisper variants. For engineering teams, this implies re-architecting ASR deployments to run on HF Endpoints with these Whisper variants, validating latency and accuracy on production data, and ensuring hardware availability aligns with Ada-capable GPUs to realize the 8x speedup.

Affected Systems

OpenAI WhispervLLM

Date: Date not specified
Change type: capability
Severity: info

Whisper on Hugging Face Inference Endpoints achieves up to 8x speed using vLLM on Ada GPUs

More from Hugging Face

Get alerts for Hugging Face