Skip to content

Notes on vAttention

Published: at 07:47 AM

Dynamic Memory Management for Serving LLMs without PagedAttention

Table of contents

Open Table of contents

1. Introduction

The challenges of efficiently allocating gpu memory for the kv cache:

  1. the per-request kv cache grows slowly
  2. a request’s decode length is unknown.

PagedAttn faces the problem: dynamically allocated objects are not guaranteed to be contiguous. This approach meets the following pitfalls:

  1. need to rewrite attn kernels
  2. force developer to implement a memory manager
  3. add runtime overhead

fundamental issue is that the reservation-based memory allocation method exposed by cuda runtime. The runtime allocates both virtual and physical memory, even if the correpsonding virtual memory is not accessed. Therefore, separating the allocation of virtual and physical memory allows more efficient perf.

Introduce vAttention, using the cuda virtual memory management(VMM) APIs.

Challenges & Sols:

ChallegensSols
memory allocation using VMM apis incur high latency, due to involving a round-trip to the OS kernel.overlap with comp., opportunistically allocate pages AOT, defer memory reclaimation
Only support allocation at the granularity of large pages, i.e. in multiples of 2MBmodify open-source cuda unified virtual memory driver.

2. Background

A single call to a CUDA VMM API can allocate one or more physical pages, which called a page-group.

3. Issues with the PagedAttenstion Approach

PagedAttn implements demanding paging in user space, which is not transparent to applications.

3.1 Requires Re-writing the Attention Kernel

3.2 Adds Redundancy in the Serving Framework

OS is able to virtual-to-physical address translation, so it need two translations which is redudant.

3.3 Performance Overhead

3.3.1 Runtime Overhead on the GPU

Introduce overhead of looking up Block-Tables and executing extra branches. The number of instructions executed is higher, and caching page indices increses register pressure.

Performance of decode kernel is sensitive to block size due to L1 cache efficiency.

3.3.2 CPU overhead

Block-Table depends on batch composition and grows proportional to max_num_blocks * batch_size, because vllm manages it as a 2D tensor so it must align the number of kv cache blocks by padding zero slots.

4. Insights

Observation 1

kv cache memory requirement is predictable on a per-iteration basis.

Observation 2

kv cache doesn’t require high memory allocation bandwidth.

5. vAttention: Design and Implementation

Primary Obeservation:

physical memory fragmentation can be avoided without making KV cache non-contigous in virtual memory.

5.1 Design Overview

allocate a large contigous buffer for the KV cache in virtual memory AOT while deferring the allocation of physical memory to runtime.

5.1.1 Pre-reserving virtual memory.

assume that each req’s context length is same as the maximum supported by the model.

5.1.2 Number of virtual memory buffers.

2 x N buffers, where N is the number of layers.

5.1.3 Size of a virtual memory buffer.

BS = B x S, where B is the maximum batch size and S is the maximum size of a single req’s per-layer K cache.

S = L x H x D x P, L is the maximum context length, P is the number of bytes based on model precision.

5.2 Leveraging CUDA Virtual Memory Support

The standard GPU memory allocation interface cudaMalloc doesn’t support demand paging.

5.2.1 CUDA virtual memory APIs.

These apis allow decouping the allocation of virtual memory from physical memory. Physical memory pages can be allocated to sub-regions in a virtual memory buffer independently of other sub-regions.

5.2.2 Extending Pytorch caching allocator.

virtual tensors.

5.2.3 Request-level kv cache indexing

A virtual tensor resprensents the K cache of a layer for the maximum batch size B. Different requests occupy different non-overlapping sub-regions. Locate the sub-tensor of a request with a unique index reqId.

5.3 Serving LLMs with vAttention


Previous Post
Introduction to ggml
Next Post
Notes on TensorIR