Depth-Staggered Fibonacci Spacing for Sparse Attention: Static Schedules Beat Learned Dilation and Extrapolate Where Dense Attention Fails

深度交错斐波那契稀疏注意力机制：静态调度优于学习型扩张，并在稠密注意力失效处实现外推

Abstract: We study sparse self-attention in which each query attends to a dense local window plus a set of Fibonacci-spaced offsets, with a per-layer scalar alpha that compresses or expands the spacing. Across 21 language models trained under one matched recipe (60M parameters, 512 hidden, 16 layers, 426M tokens), we compare four ways of setting alpha across depth: fixed, per-layer learned, a static linear stagger, and a coprime (anti-gridding) reassignment of that stagger, together with a reach-matched power-of-2 control.

摘要： 我们研究了一种稀疏自注意力机制，其中每个查询（query）不仅关注稠密的局部窗口，还关注一组按斐波那契数列分布的偏移量，并引入每层标量 alpha 来压缩或扩展间距。在统一配方（60M 参数、512 隐藏层维度、16 层、426M token）训练的 21 个语言模型中，我们比较了四种跨深度设置 alpha 的方法：固定值、逐层学习、静态线性交错，以及该交错的互质（反网格化）重分配，并辅以覆盖范围匹配的 2 的幂次对照组。

Three results stand out. First, a static per-layer stagger improves perplexity over both fixed and learned alpha, and the gain is base-agnostic: applying the same stagger to a power-of-2 base lifts it above fixed Fibonacci and to parity with learned Fibonacci attention. Second, learning per layer is inert: it does not beat the static schedule and costs roughly five times the inference latency.

三个结果尤为突出。首先，静态的逐层交错在困惑度（perplexity）表现上优于固定值和学习型 alpha，且这种增益与基数无关：将相同的交错应用于 2 的幂次基数，其效果超过了固定斐波那契注意力，并达到了与学习型斐波那契注意力相当的水平。其次，逐层学习是无效的：它不仅没有超过静态调度，反而带来了约五倍的推理延迟。

Third, and most consequential, all sparse variants extrapolate to four times their training length with little or no degradation, whereas a recipe-matched dense baseline collapses (perplexity rises by 201% at 4x length); we attribute this to fixed-offset attention only ever querying relative positions seen during training. We also report two honest negatives: at training length the best sparse model has about 26% higher perplexity than the dense baseline, and the staggering gain is uniform across context positions rather than concentrated at long range.

第三点也是最重要的一点，所有稀疏变体在扩展至训练长度四倍时，几乎没有或完全没有性能下降，而配方匹配的稠密基准模型则会崩溃（在 4 倍长度下困惑度上升了 201%）；我们将此归因于固定偏移注意力机制仅查询训练期间见过的相对位置。我们还报告了两个诚实的负面结果：在训练长度下，最佳稀疏模型的困惑度比稠密基准高出约 26%，且交错带来的增益在整个上下文位置中是均匀分布的，而非集中在长距离范围内。