ART: Attention Run-time Termination for Efficient Large Language Model Decoding
ART: Attention Run-time Termination for Efficient Large Language Model Decoding
ART:用于高效大语言模型解码的注意力运行时终止机制
Abstract: Long-context decoding in Large Language Models (LLMs) is severely constrained by the memory bandwidth required to fetch the extensive Key-Value (KV) cache. Most existing KV management methods rely on key-only pruning before decoding, despite the evidence that attention outputs depend jointly on keys and values, as incorporating values in their methods incurs prohibitive additional overhead.
摘要: 大语言模型(LLM)的长上下文解码受到内存带宽的严重制约,因为读取庞大的键值(KV)缓存需要消耗大量带宽。目前大多数现有的 KV 管理方法在解码前仅依赖于“键(Key)”的剪枝,尽管已有证据表明注意力输出同时取决于键和值,但将“值(Value)”纳入这些方法会带来难以承受的额外开销。
In this paper, we propose Attention Run-time Termination (ART), a lightweight run-time mechanism that tracks accumulated attention outputs during kernel execution and terminates subsequent KV block accesses once further contributions become negligible. This design makes ART orthogonal to existing key-based KV cache management methods, enabling seamless integration with them.
在本文中,我们提出了注意力运行时终止机制(Attention Run-time Termination, ART)。这是一种轻量级的运行时机制,能够在内核执行期间跟踪累积的注意力输出,并在后续贡献变得微不足道时终止对 KV 块的访问。这种设计使得 ART 与现有的基于键的 KV 缓存管理方法正交,从而能够与它们无缝集成。
Experiments on LongBench benchmarks show that ART achieves 20% higher generation throughput in large batch size than state-of-the-art baseline while maintaining comparable accuracy.
在 LongBench 基准测试上的实验表明,在处理大批量数据时,ART 的生成吞吐量比当前最先进的基准方法高出 20%,同时保持了相当的准确性。