InfoCapability

Efficient Request Queuing for LLMs with vLLM-backed Backend — per-user fair scheduling

AI Impact Summary

This article describes a front-end LLM-Server with per-user queues and a non-FIFO round-robin scheduler to ensure fair access across multiple users and models when deploying vLLM or HuggingFace TGI backends. It emphasizes dynamic backpressure by reading Prometheus metrics from the backend (/metrics) to keep the backend queue under a target length (e.g., three), thereby reducing latency for new requests while maintaining throughput. It also outlines extensions like priority-based queues, cache-aware routing, and leveraging vLLM's built-in priority scheduling to further improve latency and utilization for hosted LLM deployments.

Affected Systems

vLLMHuggingFace TGI

Date: Date not specified
Change type: capability
Severity: info

Efficient Request Queuing for LLMs with vLLM-backed Backend — per-user fair scheduling

More from Hugging Face

Get alerts for Hugging Face