ART: Attention Run-time Termination for Efficient Large Language Model Decoding

ART：用于高效大语言模型解码的注意力运行时终止机制

Abstract: Long-context decoding in Large Language Models (LLMs) is severely constrained by the memory bandwidth required to fetch the extensive Key-Value (KV) cache. Most existing KV management methods rely on key-only pruning before decoding, despite the evidence that attention outputs depend jointly on keys and values, as incorporating values in their methods incurs prohibitive additional overhead.

摘要： 大语言模型（LLM）的长上下文解码受到内存带宽的严重制约，因为读取庞大的键值（KV）缓存需要消耗大量带宽。目前大多数现有的 KV 管理方法在解码前仅依赖于“键（Key）”的剪枝，尽管已有证据表明注意力输出同时取决于键和值，但将“值（Value）”纳入这些方法会带来难以承受的额外开销。

In this paper, we propose Attention Run-time Termination (ART), a lightweight run-time mechanism that tracks accumulated attention outputs during kernel execution and terminates subsequent KV block accesses once further contributions become negligible. This design makes ART orthogonal to existing key-based KV cache management methods, enabling seamless integration with them.

在本文中，我们提出了注意力运行时终止机制（Attention Run-time Termination, ART）。这是一种轻量级的运行时机制，能够在内核执行期间跟踪累积的注意力输出，并在后续贡献变得微不足道时终止对 KV 块的访问。这种设计使得 ART 与现有的基于键的 KV 缓存管理方法正交，从而能够与它们无缝集成。

Experiments on LongBench benchmarks show that ART achieves 20% higher generation throughput in large batch size than state-of-the-art baseline while maintaining comparable accuracy.

在 LongBench 基准测试上的实验表明，在处理大批量数据时，ART 的生成吞吐量比当前最先进的基准方法高出 20%，同时保持了相当的准确性。