Efficient Request Queuing for LLMs with vLLM-backed Backend — per-user fair scheduling
AI Impact Summary
This article describes a front-end LLM-Server with per-user queues and a non-FIFO round-robin scheduler to ensure fair access across multiple users and models when deploying vLLM or HuggingFace TGI backends. It emphasizes dynamic backpressure by reading Prometheus metrics from the backend (/metrics) to keep the backend queue under a target length (e.g., three), thereby reducing latency for new requests while maintaining throughput. It also outlines extensions like priority-based queues, cache-aware routing, and leveraging vLLM's built-in priority scheduling to further improve latency and utilization for hosted LLM deployments.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info