Efficient Request Queueing in LLM-Server: Per-User Queues with vLLM Backpressure
AI Impact Summary
The article describes implementing per-user queues and a non-FIFO, round-robin scheduler in an LLM-Server to achieve fair scheduling when multiple clients contend for a shared GPU-backed backend such as vLLM or HuggingFace TGI. It proposes using Prometheus metrics from the vLLM /metrics endpoint to steer backpressure by keeping the backend queue length below a target (e.g., three), thereby reducing latency for new requests while preserving batch efficiency. It also explores extensions like multi-queue priorities, KV-cache-aware routing, and the potential use of vLLM’s priority scheduling, highlighting trade-offs among latency, throughput, and operational complexity. Operators will need instrumentation, threshold tuning, and ongoing monitoring to balance interactive latency against overall utilization across heterogeneous workloads.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info