11 min read

vLLM Serve Optimizations

Sources

Notes on how to optimize vLLM for memory requirements, inference speeds, etc..

Relevant parameters

  • max-model-len: Sets the model context length. Affects the size of the KV cache (vLLM ensures the size of the preallocated KV cache is at least 1 x max-model-len), hence affecting vRAM requirements
  • gpu_memory_utilization: Sets the fraction of each GPU's memory that vLLM will use for model weights, activations, KV cache. Default=0.9, meaning 90% of each GPU's vRAM is available for these purposes, and the remaining 10% is reserved for overhead (CUDA graphs, fragmentation, etc.) Source
  • max_num_seqs: The number of requests the server can process simultaneously. Directly affects how many concurrent users the server can handle
  • max_num_batched_tokens: The total length (amt of tokens) of all sequences in a batch. Lower values = less KV cache needed, faster inter-token latency (ITL). Higher values = faster time to first token (TTFT)
  • tensor_parallel_size: How many GPUs share each layer's computation by splitting tensors across them. TP=4 means each layer's tensor is split into 4 parts & placed onto 4 GPUs.
  • pipeline_parallel_size: How many GPUs split the model layers into stages. Used when model weights are too large to fit onto one GPU even with TP.
  • enable-expert-parallel: For MoE models; Each GPU holds a different subset of complete experts. Uses the parallelism degree specified by TP_size x DP_size
  • enforce-eager: Whether to use eager mode, or a combination of eager & graph capture mode. If set to True, less VRAM is consumed, but decoding speed is slower. If set to False, more VRAM is consumed, first few requests are slower due to graph compilations, but GPUs are more efficiently used
  • dtype: Can downcast the checkpoints & activations to lower precision dtypes, thereby reducing VRAM requirements
  • kv-cache-dtype: Same as dtype but for KV cache
  • cpu-offload-gb: How much weights to offload to the CPU, per GPU. Effectively increases our vRAM so we can hold larger sized models, but more latency due to CPU-GPU communication
  • swap-space: CPU swap space for KV cache. Allow for larger KV cache
  • block-size: Sets the block size for PagedAttention. Higher values = less overhead, more fragmentation, more VRAM wasted. Lower values = more efficient use of VRAM, more overhead managing more blocks
  • mm_processor_cache_gb: Set the size of the multi-modal cache. Higher values = less repeated preprocessing, reduced latency, more CPU RAM usage. Lower values = higher latency, less CPU RAM usage
  • speculative-model: Configures the draft model used for speculative decoding, which can increase the decoding speed

TLDR

  • Reduce memory requirement/Increase available memory: -max-model-len, +gpu_memory_utilization, -max_num_batched_tokens, +tensor-parallel-size, +pipeline-parallel-size, +enable-expert-parallel, +enforce-eager, -dtype, -kv-cache-dtype, +gpu-offload-gb, +swap-space, -block-size
  • Speed up inference: +gpu_memory_utilization, -max-num-batched-tokens, -enforce-eager, -cpu-offload-gb, +block-size, +mm_processor_cache_gb, +speculative-model

Max model length

Setting max-model-len sets the model context length, which is the maximum total number of tokens (input + output) that vLLM supports for the model.

At startup, vLLM pre-allocates a big pool of KV cache blocks on the GPU. vLLM ensures this capacity is at least enough to serve one sequence up to max-model-len - if not we'll get the "KV cache is not enough for model’s max seq len...." error

E.g. Assume we only ever need 4k token sequence. However, by setting max-model-len=8192, we're asking vLLM to reserve KV capacity for a 8k token sequence - effectively having unnecessary KV cache space.

KV cache memory is roughly proportional to num_layers x hidden_size x max_model_len x batch_size, so lowering max-model-len leads to less KV cache per sequence, and less KV cache memory needed.

Data type

Setting dtype configures the data type for model weights & activations. Setting kv-cache-dtype configures the data type for KV cache storage.

dtype casts weights & activations to a different floating-point format. We can have a fp32 model, but by setting dtype=fp16 we will convert each 32-bit float to a 16-bit float. We're just doing straightforward type cast here - reducing the precision of our float.

E.g.

  • FP32: 0.1234567890123456 (32 bits)
  • FP16: 0.1235 (16 bits)

This is different from quantization! Quantization also reduces the precision of model parameters, but unlike dtype, it does not simply reinterpret floats in a smaller float format. Instead, quantization applies model-aware math transformations that preserve accuracy as much as possible

  • Per-tensor/per-channel/per-group scaling factors are computed during the quantization phase (for post-Training Quantization; most common one), which are used to reconstruct the approximate float at inference using dequantized_value = quantized_integer * scaling_factor
  • Outlier handling may route outliers into a separate higher-precision path

Quantization

Setting quantization chooses the weight quantization scheme. Used when we're VRAM bound.'

There are two major types of quantization

  1. Offline quantization: We start with the original model. A quantization tool processes it offline, writing our entirely new model files where the weights are stored in the quantized (compressed) formats. We load these quantized files directly in vLLM; no quantization happens at runtime.
  2. On-the-fly quantization: We load the model's original weights. During vLLM's model initialization, certain layers are converted & wrapped into quantized modules. Quantization happens in memory only, not saved to disk.

If set to None, vLLM checks the quantization_config attribute in the model config file. If there is quantization config, vLLM assumes the model is already quantized and will ignore dtype entirely for weight precision. dtype will still apply to activations, but not weights.

If the model isn't quantized, but we still set --quantization, then vLLM will try to apply runtime quantization using the specified method if the backend supports on-the-fly quantization. However, not all quantization types in vLLM perform on-the-fly quantization. Many quantization formats like GPTQ, AWQ, Marlin require pretrained quantized weights. bitsandbytes support on-the-fly quantization, but its performance is lower compared to AWQ/GPTQ...

Offload to CPU

Setting cpu-offload-gb sets the amount of weights to offload to CPU per GPU. Helps us "virtually enlarge" GPU memory (e.g. 24GB VRAM + --cpu-offload-gb=10 gives 34GB effective RAM for the model).

vLLM will keep part of the model weights in CPU RAM and dynamically move them to the GPU as needed. Latency is increased as there is frequent CPU-GPU transfers

Swap Space

Setting swap-space sets the CPU swap space for KV cache per GPU

vLLM's paged attention stores KV cache blocks on the GPU. When GPU memory is full, old/less-used KV pages can be swapped out to this CPU swap space.

With a larger swap-space, we can support more concurrent requests, and longer sequence lengths without getting out of memory (OOM) issues. However, latency is increased as accessing KV cache from CPU is slow.

PagedAttention Block Size

Setting block-size sets the KV cache block size for PagedAttention

In PagedAttention, rather than holding the entire KV cache as one big contiguous buffer, vLLM splits it into many small blocks (pages). Each block holds the KV data for a contiguous number of tokens.

E.g block-size=16 means each block holds data for 16 tokens

One block can only be used by one sequence. It is not shared or mixed across sequences - hence, fragmentation exists. E.g. If block-size=16, and a sequence uses 17 tokens. Then we'd need 2 blocks, and the 2nd block will have 15 unused token slots - this partially filled block is internally fragmented. This leads to wasted VRAM, as we've allocated memory for a block that isn't fully used.

Changing the block-size affects memory management, GPU memory utilization... there are tradeoffs

  • Larger block-sizes reduce block-table overhead (i.e. to keep track of which blocks belong to which sequence and at which position) as there are fewer blocks to manage. However, this leads to more internal fragmentation, meaning more VRAM is wasted
  • Smaller block-sizes reduce wasted memory & allows more efficient memory use - since we have more fine-trained allocation, so there's higher chance that all tokens in a block are used. However, there is increased overhead for managing more blocks

Multi-modal Cache Size

Setting mm_processor_cache_gb sets the size of the multi-modal cache. Default=4

This cache is used to store processed multi-modal inputs (images, video frames, audio, etc.), so if the same input is seen again, vLLM can skip re-processing and reuse the cached processed data

When we do multi-turn interactions, the cache avoid redundant preprocessing, thereby saving CPU/GPU compute & reducing latency. However, a larger cache means more RAM is consumed.

Preemption

vLLM adopts a first-come-first-serve scheduling policy by default. When the KV cache space is insufficient to handle all requests, it preempts (interrupts & reschedules) the latest requests - ensuring the earlier requests are served first.

Some ways to resolve KV cache issues:

  • Increase gpu_memory_utilization: This increases the amount of GPU memory that vLLM is allowed to use, thereby increasing the memory allocated for KV cache
  • Decrease max_num_seqs or max_num_batched_tokens: By reducing the no. of concurrent requests in a batch, less KV cache is needed
  • Increase tensor_parallel_size or pipeline_parallel_size: Each GPU stores less model weights, and have more memory available for KV cache

Tensor Parallelism

Split the weight matrices in individual layers across multiple GPUs. Each GPU holds all layers, but only a subset of the tensors in those layers. Adds communication overhead, as GPUs must exchange & combine partial results for each forward pass - so high-speed interconnects between GPUs are needed

Use TP when the model is too large to fit on a single GPU, or to reduce memory pressure per GPU to allow more KV cache space for higher throughput (requests per second; RPS, and tokens per second; TPS)

E.g. With TP=4, for all layers:

  • GPU0: 1st slice
  • GPU1: 2nd slice
  • GPU2: 3rd slice
  • GPU3: 4th slice

Pipeline Parallelism

Partition model's layers across different GPUs. This reduces the memory requirements per GPU. Less communication overhead as GPUs only communicate once per stage.

However, due to the sequential nature of the pipeline, this can lead to "pipeline bubbles" - some GPUs are idle, waiting for activations from the preceding layer

Use PP when you've already maxed out efficient TP but need to distribute the model even further. Can also be used for deep, narrow models (many layers, but each layer is small) where layer distribution is more efficient than tensor sharding.

E.g. 48-layer transformer with PP=4:

  • GPU0: layers 0–11
  • GPU1: layers 12–23
  • GPU2: layers 24–35
  • GPU3: layers 36–47

Expert Parallelism

A specialized form of parallelism for MoE models, where different expert networks are distributed across GPUs. Each GPU holds a different subset of complete experts. It uses the degree of parallelism given by TP_size x DP_size. The flag for EP is a boolean (set as True/False), not an int!

EP only takes effect if TP or DP is configured - that is: TP_SIZE x DP_size > 1, if not EP is ignored. Source

Data Parallelism

Replicates the entire model across multiple GPU sets and processes different batches of request in parallel.

Chunked Prefill

A GPU has 2 main resources

  1. Compute (FLOPs) - how many math ops per sec it can do
  2. Memory bandwidth - how quickly it can move data between memory (vRAM) and compute cores

LLM prefill is compute-bound: Each transformer layer must do large matrix multiplications across the whole prompt, so we're limited by the GPU's compute units

LLM decoding is memory-bound: When generating new tokens, the model uses the KV cache. It repeatedly reads a large amount of cached data from memory, but only does a small amount of computation pre step. Memory access becomes the bottleneck.

When an LLM receives a prompt, it first runs the prefill step - processing all input tokens and building the KV cache before it begins to generate new tokens. With very long prompts, prefill uses lots of compute and blocks other requests, increasing latency.

Chunked prefill breaks the large prompt into smaller segments (chunks) and processes them incrementally. This lets vLLM interleave prefill chunks from long requests (compute-bound) with decode steps from other requests (memory-bound). By mixing compute-bound and memory-bound operations in the same batch, we achieve better GPU utilization, improving both throughput & latency.

In multi-request serving (vLLM), requests overlap in time. Say we're running a server with many concurrent users: Some users are feeding long prompts (prefill), others are mid-generation (decode). If the system waits for prefills to finish before handling decode, then one long prompt will block every decoding user, causing latency & throughput to spike. So, vLLM must interleave these work - doing prefill for some requests, and decode for others - all within the same scheduling cycle.

vLLM mixes prefill chunks after batching all decode tokens. This way

  • GPU stays near 100% utilization
  • Decode latency stays low
  • Long prefills still make progress

A typical scheduler cycle will look like this

  1. Batch all pending decode steps: This ensures no active generations are stalled
  2. Check remaining max_num_batched_tokens budget
  3. Fill leftover capacity with prefill chunks: If prefill is too large, chunk it dynamically

max_num_batched_tokens is the maximum number of tokens vLLM will process in a single forward pass (cycle). The mixed batch will look like this:<br>

[decode tokens from many users] + [prefill chunk]

This batch will run through the model in a single forward pass.

So, the server doesn't process one prompt at a time (prefill + decode for the same sequence). It interleaves many decode + prefill across multiple prompts in a single pass.

E.g. Given max_num_batched_tokens=4096, we have a 10,000 token prompt, & there are decode steps waiting

  1. Batch all decode tokens first (assume 1,200 tokens of decode were scheduled)
  2. Remaining budget = 4096 - 1200 = 2896 tokens
  3. Prefill wants 10,000 tokens but this is too large. vLLM schedules 2896 prefill tokens
  4. The remaining 7104 tokens are handled in future passes

If decode tokens > max_num_batched_tokens, then the pass contains only decode tokens (up to the limit). No prefill happens that pass, and the leftover decode tokens roll over to the next pass. In theory, this means prefill may be starved forever if many decode requests exist (since decode tokens constantly saturate the token budget & prefill isn't able to run).

However, in practice vLLM avoids prefill starvation because decode steps are very small. Each decode request produces only 1 token per scheduler cycle, since a new decode generation can only happen after generating the previous one. Even with 1000 decode requests, we only allocate 1000 decode tokens per cycle. Prefill is only starved if the number of concurrent decode requests exceed the batch token limit (i.e. >4096 concurrent decode requests).

So in reality, our example will be more like:

max_num_batched_tokens = 4096
10 decode requests = 10 decode tokens per cycle
1 huge prefill = 10,000 tokens

Cycle 1

decode: 10
remaining: 4086
prefill chunk: 4086
prefill remaining: 5914

Cycle 2

decode: 10
remaining: 4086
prefill chunk: 4086
prefill remaining: 1828

Cycle 3

decode: 10
remaining: 4086
prefill chunk: 1828
prefill remaining: 0

Prefill finishes in 3 cycles while decode requests continue uninterrputed.

We can tune performance by adjusting this max_num_batched_tokens parameter

  • Smaller values achieve better inter-token latency (ITL; time it takes between generating token n and n+1 during decoding) because there's fewer prefills slowing down decodes
  • Higher values achieve better time to first token (TTFT; time from sending a request to receiving the first generated token, which includes prefill + scheduling delay + 1st decode step) because we can process more prefill tokens in a batch

Eager Mode

vLLM supports two modes for execution for model forward pass

  • Graph/Captured mode: vLLM tries to capture CUDA graphs once, and replays them for later requests
  • Standard "eager" mode: PyTorch executes operations immediately

If --enforce-eager is set, CUDA graph is disabled & the model always executes in eager mode. If False, then both CUDA graph and eager execution is used in hybrid.

Graphs consume additional VRAM. If memory issues are encountered, set --enforce-eager to disable graph compilations..

Graph mode is typically much faster and more memory-efficient, since CUDA graphs allow batching & optimized execution. However, disabling CUDA graphs reduces VRAM usage as graph buffers are not allocated.

CUDA graph shapes refer to the tensor dimensions that appear in the forward pass (for prefill/decode). The shape is affected by things like batch size, sequence length, KV cache (affected by decode sequence length; no. of tokens that have been generated so far), Prefill vs Decode mode, etc.

We want to capture CUDA graphs for shape patterns that recur. If a shape appears multiple times (i.e. becomes a stable/frequent shape), there's a high likelihood that future forward passes will reuse the captured graph.

Capturing every unique shape is wasteful, because many shapes - especially those tied to raw sequence lengths - occur only once. If we captured a graph for each of those one-off shapes, we would accumulate a large number of CUDA graphs, paying the graph-capture overhead & consuming unnecessary GPU memory on graphs that will never be replayed.

When --enforce-eager, then

  • On first requests, vLLM runs in eager mode. It then captures CUDA graphs for the forward passes once shapes are stable. There might be some warm-up/compilation overhead - TTFT (time to first token) may be slower for the first few requests
  • After graphs are captured, subsequent requests with compatible shapes reuse the captured graphs. We get higher throughput, lower latency, better GPU utilization.

When --no-enforce-eager, then

  • No graph warm-up overhead. Our very first requests will start normally (no extra capture overhead). However, as every single request runs in eager mode, we get lower peak throughput & higher latency.

Compilation Config for CUDA Graphs

Setting compilation-config controls the torch.compile optimization level when we're using CUDA graphs

  • 0: no extra optimizations
  • 3: More aggressive torch.compile. Higher compilation overhead on startup, but better steady-state latency & throughput

Speculative decoding

Setting speculative-model configures the draft model used for speculative decoding. Setting num-speculative-tokens sets how many tokens to speculate per step. If value is higher, there's more potential speedup but more wasted work if rejection rate is high.

Baseline

Memory-light configuration

vllm serve \\\\
    MODEL_NAME \\\\
    --host 0.0.0.0 \\\\
    --port 8000 \\\\
    --api-key secret \\\\
    --dtype half \\\\
    --tensor-parallel-size NUM_GPUS \\\\
    --gpu-memory-utilization 0.9 \\\\
    --max-model-len 4096 \\\\
    --max-num-seqs 4 \\\\
    --mm-processor-cache-gb 0 \\\\
    --enforce-eager