InfoCapability

vLLM prefill/decode contention: parallel prefill with limits reduces time-to-first-token under high load

AI Impact Summary

In a self-hosted LLM fleet on 24 H100 GPUs serving 50+ applications, the prefill phase must process the entire prompt and can saturate GPUs, creating a bottleneck when many requests arrive concurrently. Because prefill chunks for different requests are scheduled sequentially, a single long prompt can block the queue and delay the first token of following requests. A proposed fix introduces request-parallel prefills with limits (e.g., batch four prefills but only one longer than 10k tokens) to reduce time-to-first-token for short requests, though decode latency remains affected during concurrent prefill. A full mitigation likely requires architectural changes such as separate inference servers for long prompts or disaggregated prefill and decode engines, which would demand additional GPU resources and a redesigned scheduler.

Affected Systems

vLLM

Date: Date not specified
Change type: capability
Severity: info

vLLM prefill/decode contention: parallel prefill with limits reduces time-to-first-token under high load

More from Hugging Face

Get alerts for Hugging Face