vLLM prefill/decode contention: parallel prefill with limits reduces time-to-first-token under high load
AI Impact Summary
In a self-hosted LLM fleet on 24 H100 GPUs serving 50+ applications, the prefill phase must process the entire prompt and can saturate GPUs, creating a bottleneck when many requests arrive concurrently. Because prefill chunks for different requests are scheduled sequentially, a single long prompt can block the queue and delay the first token of following requests. A proposed fix introduces request-parallel prefills with limits (e.g., batch four prefills but only one longer than 10k tokens) to reduce time-to-first-token for short requests, though decode latency remains affected during concurrent prefill. A full mitigation likely requires architectural changes such as separate inference servers for long prompts or disaggregated prefill and decode engines, which would demand additional GPU resources and a redesigned scheduler.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info