Why Decade-Old Residual Connections Still Power All of AI (And Why That’s a Problem)

Why Decade-Old Residual Connections Still Power All of AI (And Why That’s a Problem)

为什么十年前的残差连接至今仍是 AI 的核心(以及为什么这成了一个问题)

For nearly a decade, this part of neural networks barely changed. DeepSeek is trying to reinvent it. 在过去近十年里,神经网络的这一部分几乎没有发生过变化。而 DeepSeek 正试图对其进行重塑。

1. Introduction

1. 引言

Over the past decade, deep learning as a field has grown quite significantly, whether it be the compute capacity of hardware or the ingenuity behind architectures that utilize that hardware. But if you think about it for more than a second, the underlying architecture has remained consistent in a few key areas. We’ve seen a massive shift from convolutional networks to the new Transformer architectures that power today’s large language models, but the way these networks route information from one layer to another hasn’t changed all that much. 在过去十年中,深度学习领域取得了显著增长,无论是硬件的计算能力,还是利用这些硬件的架构创新。但如果你深入思考一下,会发现其底层架构在几个关键领域保持了一致。我们见证了从卷积网络到如今驱动大语言模型的 Transformer 架构的巨大转变,但这些网络在层与层之间传输信息的方式并没有发生太大的变化。

Recently, researchers at DeepSeek-AI released a paper titled “mHC: Manifold-Constrained Hyper-Connections,” (Xie et al., 2025b) which proposes an entirely new redesign of this routing system. To really appreciate the solution they came up with, let’s look at how signal propagation has evolved over the past few generations of models, and why the current methods are hitting a wall. 最近,DeepSeek-AI 的研究人员发表了一篇题为《mHC: Manifold-Constrained Hyper-Connections》(Xie 等人,2025b)的论文,提出了一种全新的路由系统重构方案。为了真正理解他们提出的解决方案,让我们回顾一下信号传播在过去几代模型中是如何演变的,以及为什么当前的方法正面临瓶颈。

2. The Backbone: Standard Residual Connections

2. 骨干:标准残差连接

Firstly, to understand the specific problem that the authors are trying to solve, we need to talk about where it all started–The standard Residual Connection (He et al., 2015). Introduced back in 2015 with ResNets, the residual connection is arguably one of the most important architectural design choices used in every AI model out there. 首先,为了理解作者试图解决的具体问题,我们需要谈谈这一切的起点——标准残差连接(He 等人,2015)。残差连接于 2015 年随 ResNet 引入,可以说是目前所有 AI 模型中最重要的一项架构设计选择。

Mathematically, it looks like this: 从数学角度来看,它表现为: $x_{l+1} = x_l + F(x_l)$ (Where $x_{l+1}$ is the final output activation, $x_l$ is the input, and $F(.)$ is the transformation applied by the layer.) (其中 $x_{l+1}$ 是最终输出激活,$x_l$ 是输入,$F(.)$ 是该层应用的变换。)

It simply means that the final output of a layer is the sum of its output and the input it originally got. The key component here is that bare $x_l$ term in the residual stream, which we call the identity mapping. It’s important because it acts as an uninterrupted pathway for the gradient signal to flow through the entire network from start to finish. This property is exactly what prevents gradients from vanishing or exploding during training and allows us to successfully train models with hundreds of layers while still ensuring each layer learns and updates itself effectively. 这简单地意味着一层的最终输出是其变换后的输出与原始输入的总和。这里的关键组件是残差流中那个纯粹的 $x_l$ 项,我们称之为恒等映射(Identity Mapping)。它之所以重要,是因为它充当了梯度信号从头到尾流经整个网络的无中断路径。这一特性正是防止训练过程中梯度消失或爆炸的关键,使我们能够成功训练拥有数百层的模型,同时确保每一层都能有效地学习和更新。

2.1 The Problem with Standard Residual Connections

2.1 标准残差连接的问题

But as models have grown increasingly massive, we’ve started to hit the limits of this straightforward approach. In a standard transformer model, we can imagine the residual stream as having a fixed width, which we can refer to as dimension $C$. Every piece of context, memory, and feature representation has to be crammed into this single $C$-dimensional vector as it moves up the network. Over time, as the model layers make the information more abstract and expressive, the $x_l$ term from the residual stream then becomes the information bottleneck. 但随着模型规模变得越来越庞大,我们开始触及这种简单方法的极限。在标准的 Transformer 模型中,我们可以将残差流想象成具有固定的宽度,即维度 $C$。当信息在网络中向上传递时,每一段上下文、记忆和特征表示都必须被塞进这个单一的 $C$ 维向量中。随着模型层数增加,信息变得更加抽象和具有表现力,残差流中的 $x_l$ 项就成了信息瓶颈。

2.2 The Improvement: Hyper-Connections (HC)

2.2 改进方案:超连接 (Hyper-Connections, HC)

Because of the above-stated limitation, researchers at ByteDance introduced an alternative to the vanilla residual stream, known as Hyper-Connections (Zhu et al., 2024). If the normal residual streams are just too “thin”, HC widens them. Instead of relying on a single stream of width $C$, the idea is to expand the width of the residual stream by a specific factor, let’s say $n$. So what you now end up with is a wider vector composed of $n$ parallel streams, resulting in a total width of $n \times C$. 由于上述限制,字节跳动的研究人员引入了一种替代普通残差流的方案,即超连接(Zhu 等人,2024)。如果普通的残差流太“窄”,HC 就会将其拓宽。HC 的思路不是依赖单一宽度为 $C$ 的流,而是将残差流的宽度扩大一个特定倍数,比如 $n$。这样,你最终得到的是一个由 $n$ 个并行流组成的更宽向量,总宽度为 $n \times C$。

But since the actual computational layers of the model, like the Attention and MLP blocks, still expect a standard input with $C$ dimensions only, HC introduces a set of learnable weights to convert the vector between the wide and narrow stream: 但由于模型的实际计算层(如 Attention 和 MLP 模块)仍然只接受 $C$ 维的标准输入,HC 引入了一组可学习的权重,用于在宽流和窄流之间转换向量:

  • A Pre-Mapping Matrix: This reads from the wide stream and condenses it down to size $C$.
  • 预映射矩阵: 从宽流中读取信息并将其压缩至 $C$ 大小。
  • A Post-Mapping Matrix: This takes the layer’s narrow output and expands it back into the wide stream.
  • 后映射矩阵: 获取该层的窄输出并将其扩展回宽流。
  • A Residual Mapping Matrix: This sits directly on the residual pathway, and its purpose is to mix the information across the $n$ parallel streams as the signal moves forward.
  • 残差映射矩阵: 位于残差路径上,其目的是在信号向前移动时,混合 $n$ 个并行流中的信息。

2.3 The Flaws in Hyper-Connections

2.3 超连接的缺陷

The reality of the situation, however, is that while HC looks great on paper, it introduces a couple of fatal flaws when you try to scale it up to the size of what our current LLMs are: 然而现实情况是,虽然 HC 在理论上看起来很棒,但当你试图将其扩展到当前大语言模型的规模时,它会带来几个致命的缺陷:

  • Mathematical Instability: That Residual Mapping matrix, although expressive, destroys the crucial identity mapping property. Because it can learn any value, it no longer perfectly conserves the original signal. A tiny feature scale-up in one layer compounds exponentially when multiplied across fifty layers. DeepSeek actually found that the signal could be amplified by a staggering factor of 3,000, causing wildly erratic gradients and massive spikes in the training loss.
  • 数学不稳定性: 那个残差映射矩阵虽然具有表现力,却破坏了至关重要的恒等映射特性。因为它能学习任何值,所以不再能完美地保留原始信号。一层中微小的特征放大在经过五十层叠加后会呈指数级增长。DeepSeek 实际上发现,信号可能会被放大惊人的 3,000 倍,导致梯度极度不稳定,并引发训练损失的剧烈波动。
  • The Hardware Bottleneck: Widening the stream by a factor of $n$ forces the memory hardware to read and write significantly more data at every single step. Since memory access—not the actual computation—is often the biggest bottleneck in modern AI training, this extra overhead tanks training throughput and spikes the GPU memory.
  • 硬件瓶颈: 将流宽度扩大 $n$ 倍,迫使内存硬件在每一步都读写更多的数据。由于内存访问(而非实际计算)往往是现代 AI 训练中最大的瓶颈,这种额外的开销会严重拖累训练吞吐量,并导致 GPU 内存占用激增。