One "+x" That Made 100-Layer Networks Trainable: ResNet Skip Connections

One “+x” That Made 100-Layer Networks Trainable: ResNet Skip Connections

一个让百层神经网络可训练的“+x”：ResNet 跳跃连接

Deep networks have a cruel paradox. In theory, more layers should never hurt — the extra ones could just learn to pass their input through unchanged. In practice, before 2015, stacking more plain layers made networks worse: a 56-layer net had higher training error than a 20-layer one. The gradient vanished on its way back to the early layers, and optimisation couldn’t even find that “do nothing” identity mapping.

深度神经网络存在一个残酷的悖论。理论上，增加层数不应带来负面影响——多出来的层只需学习如何将输入原封不动地传递下去即可。但在实践中，2015 年之前，堆叠更多的普通层反而会让网络表现变差：56 层网络的训练误差比 20 层网络还要高。梯度在反向传播回浅层时消失了，优化过程甚至无法找到那种“什么都不做”的恒等映射。

ResNet fixed it with almost absurdly little. The residual reformulation: Instead of asking a block to learn a full mapping H(x), ask it to learn the residual F(x) = H(x) − x, and add the input back: def forward(self, x): return F.relu(x + self.f(x)) # y = x + F(x) <- the skip connection

ResNet 用一种极其简单的方法解决了这个问题。残差重构：不再要求一个模块去学习完整的映射 H(x)，而是让它学习残差 F(x) = H(x) − x，并将输入加回去： def forward(self, x): return F.relu(x + self.f(x)) # y = x + F(x) <- 即跳跃连接

If the ideal mapping is close to identity, F(x) just needs to be near zero — trivial to learn (push the weights toward 0). The block only learns the correction on top of passing the input through.

如果理想映射接近恒等映射，F(x) 只需接近于零——这非常容易学习（只需将权重推向 0）。该模块仅在传递输入的基础上学习修正部分。

Why the +1 saves the gradient: Differentiate the block: d(x + F(x))/dx = 1 + F’(x). Backprop multiplies these across blocks. Even when F’(x) is tiny, the factor stays near 1 instead of near 0 — so the product doesn’t collapse: plain: dL/dx1 = product of F’(z) -> 0 (each F’ <= ~0.25 for sigmoid) residual: dL/dx1 = product of (1 + F’(z)) -> ~O(1) ^ the identity path never vanishes. The identity path is a gradient highway straight back to the earliest layers.

为什么“+1”能拯救梯度：对模块求导：d(x + F(x))/dx = 1 + F’(x)。反向传播会将这些导数在各层间相乘。即使 F’(x) 非常小，该因子仍保持在 1 附近而不是 0 附近——因此乘积不会坍缩：普通网络：dL/dx1 = F’(z) 的乘积 -> 0（对于 Sigmoid 函数，每个 F’ <= ~0.25）残差网络：dL/dx1 = (1 + F’(z)) 的乘积 -> ~O(1) ^ 恒等路径永不消失。恒等路径是一条直通最浅层的梯度高速公路。

Projection shortcuts: When a block changes the feature dimensions (a conv that halves spatial size, doubles channels), x and F(x) no longer match, so you can’t add them. Put a 1×1 conv on the skip to project x into the new shape first — the “projection shortcut” from the paper. Most shortcuts are plain identity; only dimension-changing ones need this.

投影快捷连接：当模块改变特征维度时（例如卷积层将空间尺寸减半、通道数加倍），x 和 F(x) 无法匹配，因此不能直接相加。在跳跃连接上放置一个 1×1 卷积，先将 x 投影到新的形状——这就是论文中的“投影快捷连接”。大多数快捷连接是简单的恒等映射；只有改变维度的连接才需要这样做。

The impact: With residual blocks, the 2015 ResNet paper trained 152-layer networks — an order of magnitude deeper than what worked before — and won ImageNet. Deeper finally meant better again. And skip connections are now everywhere: ResNets, U-Nets, and every Transformer block (x + Sublayer(x)). The same +1 quietly keeps gradients healthy inside modern LLMs.

影响：借助残差模块，2015 年的 ResNet 论文成功训练了 152 层的网络——比之前的有效深度高出一个数量级——并赢得了 ImageNet 竞赛。更深终于再次意味着更好。现在跳跃连接无处不在：ResNet、U-Net 以及每一个 Transformer 模块 (x + Sublayer(x))。同样的“+1”也在现代大语言模型（LLM）中默默地保持着梯度的健康。