The Weight Norm Sets the Grokking Timescale: A Causal Delay Law
The Weight Norm Sets the Grokking Timescale: A Causal Delay Law
权重范数决定了“顿悟”(Grokking)的时间尺度:一种因果延迟定律
Abstract: Grokking is the delayed onset of generalization in neural networks, arising long after they fit the training data. Whether the weight norm causes this delay is disputed: some studies report a critical norm at the transition, others observe grokking with no fixed norm at all.
摘要: “顿悟”(Grokking)是指神经网络在拟合训练数据很久之后才出现的延迟泛化现象。权重范数是否导致了这种延迟一直存在争议:一些研究报告称在转变点存在一个临界范数,而另一些研究则观察到在没有固定范数的情况下也会发生顿悟。
We settle this by intervening on the norm during training rather than only observing it. Under free training with weight decay, networks grok when the weight norm reaches a value $W_c$ that varies little across seeds and learning rates (CV 1 to 2 percent) and grows with the modular base as a power law.
我们通过在训练过程中对范数进行干预,而非仅仅进行观察,解决了这一争议。在带有权重衰减的自由训练下,当权重范数达到一个值 $W_c$ 时,网络会发生顿悟;该值在不同随机种子和学习率下几乎没有变化(变异系数 CV 为 1% 到 2%),并随模运算基数呈幂律增长。
When we instead clamp the norm to a fixed multiple $\rho$ of $W_c$ and hold it there, the network still groks, but the delay follows $T_{grok} \propto \exp(\alpha \rho)$. One exponent, $\alpha$ near 7.5, fits this delay across four moduli ($R^2 = 0.996$).
当我们转而将范数钳制在 $W_c$ 的固定倍数 $\rho$ 并保持不变时,网络仍然会发生顿悟,但其延迟遵循 $T_{grok} \propto \exp(\alpha \rho)$ 的规律。一个接近 7.5 的指数 $\alpha$ 可以很好地拟合四个模数下的这种延迟($R^2 = 0.996$)。
Over the swept ranges the held norm moves the delay by about 19x and the learning rate by only about 2x, and holding the norm above $W_c$ slows grokking rather than preventing it. A final LayerNorm removes the dependence by decoupling weight scale from the network function; without it the exponential law returns. This pinned-norm delay is the exponential counterpart to the logarithmic delay predicted for a freely contracting norm.
在所扫描的范围内,保持的范数使延迟改变了约 19 倍,而学习率仅改变了约 2 倍;将范数保持在 $W_c$ 以上会减缓顿悟,而不是阻止它。最后的 LayerNorm 层通过将权重尺度与网络函数解耦,消除了这种依赖性;如果没有它,指数定律就会重现。这种固定范数的延迟是自由收缩范数所预测的对数延迟的指数对应物。