Do Value Vectors in Deep Layers Need Context from the Residual Stream?

Do Value Vectors in Deep Layers Need Context from the Residual Stream?

深层网络中的值向量(Value Vectors)是否需要残差流中的上下文信息?

The success of the transformer architecture as the backbone of modern LLMs is in large part due to its use of attention layers. An attention layer follows the standard neural network paradigm: it takes the residual stream as input and thereby produces context-dependent query, key, and value vectors. Transformer 架构作为现代大语言模型(LLM)的基石,其成功在很大程度上归功于注意力层(Attention Layers)的应用。注意力层遵循标准的神经网络范式:它以残差流(Residual Stream)作为输入,从而生成依赖于上下文的查询(Query)、键(Key)和值(Value)向量。

However, we find that model performance meaningfully improves when deeper layers learn only a context-free value vector to preserve the original token information, without drawing on any context from the residual stream. When the model has access to this context-free value vector, adding back the context-dependent component provides little additional benefit for aggregate benchmark performance. 然而,我们发现当深层网络仅学习一种“无上下文”的值向量以保留原始 Token 信息,而不从残差流中提取任何上下文时,模型性能会有显著提升。当模型能够获取这种无上下文的值向量时,再添加回依赖上下文的组件,对整体基准测试性能的提升微乎其微。

Such context-free value vectors can be stored as sparse model parameters, eliminating the need to recompute or persistently cache these values. Through systematic ablations on the key design choices for such context-free value vectors, we propose Bank of Values (BoV), a new way of computing value vectors in attention by learning a lookup table of token-specific value vectors for each of the last third of layers. 这些无上下文的值向量可以作为稀疏模型参数进行存储,从而无需重新计算或持久缓存这些数值。通过对这些无上下文值向量的关键设计选择进行系统性的消融实验,我们提出了“值库”(Bank of Values, BoV)。这是一种在注意力机制中计算值向量的新方法,通过为最后三分之一的每一层学习一个特定于 Token 的值向量查找表来实现。

Across 135M and 780M models, BoV improves validation loss over standard attention and, at 780M, the average score across 21 benchmarks, matching the previous best method that adds token information to the value vector with less compute and memory. 在 135M 和 780M 参数规模的模型上,BoV 相比标准注意力机制改善了验证损失;在 780M 模型上,BoV 在 21 项基准测试中的平均得分与此前将 Token 信息添加到值向量中的最佳方法持平,且消耗的计算资源和内存更少。