InfoCapability

Hugging Face Inference Endpoints delivers 8x faster Whisper via vLLM on Ada Lovelace GPUs

AI Impact Summary

OpenAI Whisper is now offered as a faster deployment option on Hugging Face Inference Endpoints, powered by the vLLM backend to maximize throughput. The release uses PyTorch torch.compile, CUDA graphs, and reduced-precision KV caching to deliver up to ~8x faster RTFx on Whisper Large V3 variants while preserving comparable WER to Transformer baselines, validated across standard ASR datasets. This enables high-volume, cost-efficient real-time transcription workloads and encourages community-driven deployments via the Hugging Face Hub and HF Endpoints.

Affected Systems

OpenAI WhisperInference Endpoints (Hugging Face)

Date: Date not specified
Change type: capability
Severity: info

Hugging Face Inference Endpoints delivers 8x faster Whisper via vLLM on Ada Lovelace GPUs

More from Hugging Face

Get alerts for Hugging Face