InfoCapability

Efficient Request Queueing in LLM-Server: Per-User Queues with vLLM Backpressure

AI Impact Summary

The article describes implementing per-user queues and a non-FIFO, round-robin scheduler in an LLM-Server to achieve fair scheduling when multiple clients contend for a shared GPU-backed backend such as vLLM or HuggingFace TGI. It proposes using Prometheus metrics from the vLLM /metrics endpoint to steer backpressure by keeping the backend queue length below a target (e.g., three), thereby reducing latency for new requests while preserving batch efficiency. It also explores extensions like multi-queue priorities, KV-cache-aware routing, and the potential use of vLLM’s priority scheduling, highlighting trade-offs among latency, throughput, and operational complexity. Operators will need instrumentation, threshold tuning, and ongoing monitoring to balance interactive latency against overall utilization across heterogeneous workloads.

Affected Systems

vLLM

Date: Date not specified
Change type: capability
Severity: info

Efficient Request Queueing in LLM-Server: Per-User Queues with vLLM Backpressure

More from Hugging Face

Get alerts for Hugging Face