Title: A Case for the KV Cache Layer - Enabling the Next Phase of Fast Distributed LLM Serving
Speaker: Yuhan Liu (University of Chicago) [webpage: https://yuhanliu11.github.io/]
Time: 10:00 am, October 24 (Friday), 2025
Location: Ryder Hall 156
Online link: provided upon request or see the seminar email.

Abstract:

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging, as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays.

CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache’s distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to maintain low context-loading delay and high generation quality. When available bandwidth drops, CacheGen may raise the compression level for a part of the context or recompute its KV cache on the fly. We test CacheGen on popular LLMs and datasets. Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5-4.3x and the total delay in fetching and processing contexts by 3.2-3.7x with negligible impact on the LLM response quality. Our code is available in our open-source project, LMCache: https://github.com/LMCache/LMCache.

Bio:

Yuhan Liu is a fifth-year Ph.D. student at the University of Chicago, co-advised by Junchen Jiang and Shan Lu. Her research focuses on building an efficient KV cache layer for serving LLMs, including KV cache compression, dynamic blending of KV caches, and cross-model KV cache sharing. She also leads LMCache, the open-source KV cache layer of vLLM.