Whisper-based ASR with diarization and speculative decoding via Hugging Face Inference Endpoints
AI Impact Summary
This introduces a custom inference handler that wires Whisper ASR, a Pyannote diarization model, and optional speculative decoding into a single Hugging Face Inference Endpoint. The solution is broken into modular files (handler.py, diarization_utils.py, config.py) and uses ModelSettings/InferenceConfig to select models and parameters at runtime via environment variables (HF_MODEL_DIR, DIARIZATION_MODEL, HF_TOKEN, ASR_MODEL, ASSISTANT_MODEL). It relies on PyTorch 2.2 with Flash Attention 2 and notes our speedups depend on audio length and batch sizing, with speculative decoding offering gains for short clips but potentially neutral or negative for longer inputs. Operationally, deploying this requires token management for the diarization model and programmatic endpoint provisioning to supply secrets, increasing integration complexity but enabling a unified, diarized transcription endpoint.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info