Kara: Efficient Reasoning LLM Serving via Sliding-Window KV Cache Compression

Kara：通过滑动窗口 KV Cache 压缩实现高效推理大模型服务

Abstract: Reasoning language models often generate long chain-of-thought (CoT), which accumulates a massive KV cache during the decoding phase and incurs high decoding latency and limited throughput. 摘要： 推理大模型通常会生成冗长的思维链（CoT），这会在解码阶段积累海量的 KV Cache，从而导致高解码延迟和有限的吞吐量。

To address these issues, KV cache compression has emerged as a promising technique for reducing memory overhead by selectively removing unimportant KV pairs while preserving useful ones for subsequent decoding. 为了解决这些问题，KV Cache 压缩技术应运而生。它通过选择性地移除不重要的 KV 对，同时保留对后续解码有用的信息，成为降低内存开销的一种极具前景的技术方案。

Nevertheless, we identify two key limitations in existing KV cache compression methods: 1) their threshold-triggered compression policy may provide limited throughput improvement or even reduce throughput, and may fully eliminate KV pairs from certain blocks of the sequence, potentially worsening information loss. 2) they typically retain either isolated KV pairs or fixed-size chunks with rigid boundaries, failing to preserve important flexible-sized chunks at arbitrary token positions. 然而，我们发现现有的 KV Cache 压缩方法存在两个主要局限性：1）其基于阈值的压缩策略可能带来的吞吐量提升有限，甚至反而降低吞吐量，且可能完全剔除序列中某些块的 KV 对，从而加剧信息丢失；2）它们通常只能保留孤立的 KV 对或具有固定边界的块，无法在任意 Token 位置保留重要的、大小灵活的块。

To overcome these limitations, we propose Kara, a sliding-window KV cache compression method that performs decoding-time compression by operating only on the recently generated context. 为了克服这些局限性，我们提出了 Kara，这是一种滑动窗口 KV Cache 压缩方法，仅通过对最近生成的上下文进行操作，在解码时执行压缩。

Kara leverages bidirectional attention to score and select informative KV pairs in the window. To enable flexible preservation of important semantic information, we design a Token2Chunk module to expand a subset of selected KV pairs into chunks. Kara 利用双向注意力机制对窗口内的信息量大的 KV 对进行评分和筛选。为了实现对重要语义信息的灵活保留，我们设计了一个 Token2Chunk 模块，将选定 KV 对的子集扩展为块。

Furthermore, we adapt Kara to PagedAttention and develop KvLLM, an inference framework built upon vLLM, which reduces KV cache memory usage and effectively improves output throughput. Extensive experiments demonstrate consistent performance improvements of proposed Kara and KvLLM. 此外，我们将 Kara 适配到 PagedAttention 中，并开发了 KvLLM——一个基于 vLLM 构建的推理框架。该框架有效降低了 KV Cache 的内存占用，并显著提升了输出吞吐量。大量实验证明，所提出的 Kara 和 KvLLM 能够带来持续的性能提升。