Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

PyTorch 分析(第二部分):从 nn.Linear 到融合 MLP

In the first part of this series “Profiling in PyTorch”, we used torch.add(torch.matmul(x, w), b) to learn how to read PyTorch profiler traces. We also discussed several other topics that came our way - the CPU dispatch chain, launch overhead, the difference between an overhead-bound and a compute-bound regime, and some internals of torch.compile. 在本系列的第一部分“PyTorch 分析”中,我们使用 torch.add(torch.matmul(x, w), b) 学习了如何解读 PyTorch 分析器(Profiler)的追踪记录。我们还讨论了其他几个相关主题——CPU 分发链、启动开销、开销受限(overhead-bound)与计算受限(compute-bound)模式的区别,以及 torch.compile 的一些内部机制。

In the second iteration (this blog post), we climb one rung up the ladder. We replace the hand-written matmul-add pair with an nn.Linear (with bias=True). This is the building block every deep learning model uses. We then stack three of them (specific to our example), with an activation in between, to form a Multilayer Perceptron (MLP) block. 在第二部分(即本文)中,我们将更进一步。我们将手动编写的矩阵乘法-加法对替换为 nn.Linear(设置 bias=True)。这是每个深度学习模型都会使用的基础构建块。随后,我们将三个这样的层堆叠起来(针对我们的示例),并在中间加入激活函数,从而构成一个多层感知机(MLP)模块。

The scripts for this blog post live here: 02_linear.py, 03_simple_mlp.py, and 03_kernels_mlp.py. Like before, it helps to open them in a separate tab and walk through the code as you read. We use an NVIDIA A100-SXM4-80GB GPU to run the scripts. It is really easy to set up a GPU on the Hugging Face infrastructure and experiment with the scripts using Dev Mode with Spaces. One could also run the scripts with the Hugging Face Jobs pipeline. 本文的脚本位于此处:02_linear.py03_simple_mlp.py03_kernels_mlp.py。和之前一样,建议在单独的标签页中打开它们,并在阅读时对照代码。我们使用 NVIDIA A100-SXM4-80GB GPU 来运行这些脚本。在 Hugging Face 基础设施上设置 GPU 并使用 Spaces 的 Dev Mode 进行实验非常简单。你也可以使用 Hugging Face Jobs 流水线来运行这些脚本。

Before we begin, a quick recap of two ideas we will lean on repeatedly: A GPU kernel is a program that runs in parallel on many threads of the GPU. The CPU schedules and launches these kernels. Most of the PyTorch overhead you see in a profiler trace is this scheduling work. 在开始之前,先快速回顾两个我们将反复用到的概念:GPU 内核(Kernel)是在 GPU 的多个线程上并行运行的程序。CPU 负责调度并启动这些内核。你在分析器追踪记录中看到的大部分 PyTorch 开销,实际上都是这种调度工作。

From matmul-add to Linear

从 matmul-add 到 Linear

nn.Linear is a module wrapper around the same matrix multiplication and addition we already profiled in Part 1. The only difference is that it owns its weight and bias as parameters and exposes a forward method that PyTorch users have grown familiar with. nn.Linear 是对我们在第一部分中分析过的矩阵乘法和加法操作的模块化封装。唯一的区别在于,它将权重(weight)和偏置(bias)作为参数持有,并提供了一个 PyTorch 用户非常熟悉的 forward 方法。

# bias=True would truly emulate the multiplication and addition 
# operations we have seen in part 1 of the series
linear_layer = nn.Linear(in_dim, out_dim, bias=True)
y = linear_layer(x)

The operation at hand can be written as: y = x @ w.T + b. Where x is the input, w is the weight and b is the bias. Let’s run 02_linear.py and check the profile. 当前的操作可以写成:y = x @ w.T + b。其中 x 是输入,w 是权重,b 是偏置。让我们运行 02_linear.py 并查看分析结果。

trace-util is a utility that will sync your traces to a Hugging Face bucket and then provide the Preffeto URLs on your terminal. trace-util 是一个实用工具,它会将你的追踪记录同步到 Hugging Face 存储桶,并在终端中提供 Perfetto URL。

Figure 1 shows the profiler trace of a forward call of the linear layer. We trace the forward call of the linear layer with a similar schedule setup as the previous traces, with wait=1, warmup=1 and active=3. This is why we see three Profile Steps in the CPU and GPU lanes. 图 1 展示了线性层前向传播调用的分析器追踪记录。我们使用与之前追踪记录类似的调度设置(wait=1, warmup=1, active=3)来追踪线性层的前向调用。这就是为什么我们在 CPU 和 GPU 通道中看到了三个分析步骤(Profile Steps)。

What is the transpose doing?

转置操作在做什么?

If we zoom into the profiler trace, as we do in Figure 2, we notice an aten::t (transpose) op before the aten::addmm (multiplication and addition) op. We can already figure out that nn.Linear transposes the weight parameter and then multiplies it with the input. This is the reason we see an aten::t op. 如果我们放大分析器追踪记录(如图 2 所示),会发现在 aten::addmm(乘法和加法)操作之前有一个 aten::t(转置)操作。我们由此可以推断出 nn.Linear 会先对权重参数进行转置,然后再将其与输入相乘。这就是我们看到 aten::t 操作的原因。

An important thing to notice is that aten::t does not really copy or reorganize data: it only rewrites tensor metadata (shape and stride) on the CPU to represent the transposed matrix. It does not launch a kernel on the GPU. One can verify this two ways: by looking at the GPU lane in the trace, or by checking the aten::t row in the profiler table and the time it took on CUDA. 需要注意的一点是,aten::t 实际上并不会复制或重组数据:它只是在 CPU 上重写了张量的元数据(形状和步长)来表示转置后的矩阵。它不会在 GPU 上启动内核。可以通过两种方式验证这一点:查看追踪记录中的 GPU 通道,或者检查分析器表格中的 aten::t 行及其在 CUDA 上耗费的时间。

Why are there no separate mul and add kernels?

为什么没有独立的乘法和加法内核?

There is no aten::add (the bias addition) in the dispatch chain of the linear layer, as seen in Figure 3. This is because the bias addition has been folded into the matrix multiplication kernel, using what is called an epilogue. 如图 3 所示,线性层的分发链中没有 aten::add(偏置加法)。这是因为偏置加法已经被合并到了矩阵乘法内核中,使用了所谓的“尾声”(epilogue)。

An epilogue is a small computation that a GEMM (GEneral Matrix Multiply) kernel does at the very end, just before it writes its result back to HBM (High Bandwidth Memory, the GPU’s main memory). Adding a bias, applying an activation, or scaling by a constant are all classic epilogues. The point of an epilogue is to avoid loading or writing to HBM a second time, since memory traffic makes an operation expensive. “尾声”是 GEMM(通用矩阵乘法)内核在最后阶段执行的一小段计算,就在将结果写回 HBM(高带宽内存,即 GPU 的主内存)之前。添加偏置、应用激活函数或进行常数缩放都是典型的尾声操作。尾声的意义在于避免对 HBM 进行二次读写,因为内存传输会使操作变得昂贵。

nn.Linear calls torch.nn.functional.linear, which, in turn, calls aten::linear. aten::linear looks at the inputs, notices that a bias was passed, and dispatches aten::addmm(bias, x, weight) instead of doing a matmul and an add separately. addmm computes: out = x @ weight.T + bias. nn.Linear 调用 torch.nn.functional.linear,后者进而调用 aten::linearaten::linear 会检查输入,发现传入了偏置,于是分发 aten::addmm(bias, x, weight),而不是分别执行矩阵乘法和加法。addmm 计算的是:out = x @ weight.T + bias

The cuBLAS GEMM kernel that runs on the GPU has a bias-add variant built in, and that’s the kernel aten::addmm picks. The add never appears as a separate kernel because it is part of the matmul kernel’s writeback, which is exactly what an epilogue is. 在 GPU 上运行的 cuBLAS GEMM 内核内置了偏置加法变体,这正是 aten::addmm 所选择的内核。加法操作永远不会作为单独的内核出现,因为它属于矩阵乘法内核回写过程的一部分,这正是“尾声”的定义。

This is the moment to notice something subtle. The kernel you saw in Part 1 under --compile (addmm) is the kernel that eager nn.Linear already uses. There is nothing left for torch.compile to fuse here, which is the next thing we will verify. 此时需要注意一个微妙之处。你在第一部分中在 --compile 下看到的内核(addmm)正是 eager 模式下的 nn.Linear 已经在使用的内核。这里已经没有什么可以留给 torch.compile 去融合的了,这也是我们接下来要验证的内容。

Can —compile help a single Linear?

—compile 对单个 Linear 层有帮助吗?

Let’s compile the forward call and look at the profiler trace. 让我们编译前向调用并查看分析器追踪记录。

If you compare the eager and compiled traces for a single nn.Linear’s forward, you will find: 如果你对比单个 nn.Linear 前向传播的 eager 模式和编译模式的追踪记录,你会发现:

  1. The same cuBLAS GEMM kernel on the GPU. (GPU 上是相同的 cuBLAS GEMM 内核。)
  2. The same aten::addmm op on the CPU. (CPU 上是相同的 aten::addmm 操作。)
  3. A few extra rows on the CPU lane unique to compile. (CPU 通道上多出了一些编译特有的行。)

This is worth internalizing. A common reflex is to reach for torch.compile whenever a model feels slow. For a single GEMM-with-bias, compile has very little to do. This is not a bug, this is just that compile needs more than one operation to possibly do any fusing. 这一点值得深入理解。一个常见的反应是,每当模型运行缓慢时就想使用 torch.compile。但对于单个带偏置的 GEMM 操作,编译器的优化空间非常有限。这不是 bug,只是因为编译器需要多个操作才能进行有效的融合。