One "+x" That Made 100-Layer Networks Trainable: ResNet Skip Connections
One “+x” That Made 100-Layer Networks Trainable: ResNet Skip Connections
一个让百层神经网络可训练的“+x”:ResNet 跳跃连接
Deep networks have a cruel paradox. In theory, more layers should never hurt — the extra ones could just learn to pass their input through unchanged. In practice, before 2015, stacking more plain layers made networks worse: a 56-layer net had higher training error than a 20-layer one. The gradient vanished on its way back to the early layers, and optimisation couldn’t even find that “do nothing” identity mapping.
深度神经网络存在一个残酷的悖论。理论上,增加层数不应带来负面影响——多出来的层只需学习如何将输入原封不动地传递下去即可。但在实践中,2015 年之前,堆叠更多的普通层反而会让网络表现变差:56 层网络的训练误差比 20 层网络还要高。梯度在反向传播回浅层时消失了,优化过程甚至无法找到那种“什么都不做”的恒等映射。
ResNet fixed it with almost absurdly little. The residual reformulation: Instead of asking a block to learn a full mapping H(x), ask it to learn the residual F(x) = H(x) − x, and add the input back:
def forward(self, x): return F.relu(x + self.f(x)) # y = x + F(x) <- the skip connection
ResNet 用一种极其简单的方法解决了这个问题。残差重构:不再要求一个模块去学习完整的映射 H(x),而是让它学习残差 F(x) = H(x) − x,并将输入加回去:
def forward(self, x): return F.relu(x + self.f(x)) # y = x + F(x) <- 即跳跃连接
If the ideal mapping is close to identity, F(x) just needs to be near zero — trivial to learn (push the weights toward 0). The block only learns the correction on top of passing the input through.
如果理想映射接近恒等映射,F(x) 只需接近于零——这非常容易学习(只需将权重推向 0)。该模块仅在传递输入的基础上学习修正部分。
Why the +1 saves the gradient: Differentiate the block: d(x + F(x))/dx = 1 + F’(x). Backprop multiplies these across blocks. Even when F’(x) is tiny, the factor stays near 1 instead of near 0 — so the product doesn’t collapse: plain: dL/dx1 = product of F’(z) -> 0 (each F’ <= ~0.25 for sigmoid) residual: dL/dx1 = product of (1 + F’(z)) -> ~O(1) ^ the identity path never vanishes. The identity path is a gradient highway straight back to the earliest layers.
为什么“+1”能拯救梯度:对模块求导:d(x + F(x))/dx = 1 + F’(x)。反向传播会将这些导数在各层间相乘。即使 F’(x) 非常小,该因子仍保持在 1 附近而不是 0 附近——因此乘积不会坍缩: 普通网络:dL/dx1 = F’(z) 的乘积 -> 0(对于 Sigmoid 函数,每个 F’ <= ~0.25) 残差网络:dL/dx1 = (1 + F’(z)) 的乘积 -> ~O(1) ^ 恒等路径永不消失。恒等路径是一条直通最浅层的梯度高速公路。
Projection shortcuts: When a block changes the feature dimensions (a conv that halves spatial size, doubles channels), x and F(x) no longer match, so you can’t add them. Put a 1×1 conv on the skip to project x into the new shape first — the “projection shortcut” from the paper. Most shortcuts are plain identity; only dimension-changing ones need this.
投影快捷连接:当模块改变特征维度时(例如卷积层将空间尺寸减半、通道数加倍),x 和 F(x) 无法匹配,因此不能直接相加。在跳跃连接上放置一个 1×1 卷积,先将 x 投影到新的形状——这就是论文中的“投影快捷连接”。大多数快捷连接是简单的恒等映射;只有改变维度的连接才需要这样做。
The impact: With residual blocks, the 2015 ResNet paper trained 152-layer networks — an order of magnitude deeper than what worked before — and won ImageNet. Deeper finally meant better again. And skip connections are now everywhere: ResNets, U-Nets, and every Transformer block (x + Sublayer(x)). The same +1 quietly keeps gradients healthy inside modern LLMs.
影响:借助残差模块,2015 年的 ResNet 论文成功训练了 152 层的网络——比之前的有效深度高出一个数量级——并赢得了 ImageNet 竞赛。更深终于再次意味着更好。现在跳跃连接无处不在:ResNet、U-Net 以及每一个 Transformer 模块 (x + Sublayer(x))。同样的“+1”也在现代大语言模型(LLM)中默默地保持着梯度的健康。