Prefill/Decode batching optimization for vLLM with Llama-3.1-8B on H100 GPUs
AI Impact Summary
Two-stage token generation (prefill and decode) creates a workload split where prefill drives heavy GPU compute and decode is memory-bound. Static batching compounds latency: the batch must wait to finish the longest request, wasting resources and hurting time-to-first-token. Implement continuous/dynamic batching across concurrent requests and separate prefill and decode queues to enable overlap and better GPU utilization; monitor time-to-first-token and time-per-output-token to tune batch sizes and concurrency in vLLM with Llama-3.1-8B on H100 GPUs.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info