Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

Profiling in PyTorch (Part 1): A Beginner’s Guide to torch.profiler

PyTorch 性能分析（第一部分）：torch.profiler 初学者指南

What you cannot profile, you cannot optimize. Whether you are trying to squeeze more tokens per second out of a Large Language Model (LLM), shave milliseconds off inference, or just understand why your training loop runs slower than the spec sheet promises, the path eventually runs through profiling. 如果你无法进行性能分析，你就无法进行优化。无论你是想从大语言模型（LLM）中榨取更多的每秒生成 Token 数，缩短推理的毫秒级延迟，还是仅仅想弄清楚为什么你的训练循环比规格说明书上承诺的要慢，性能分析都是必经之路。

The catch is that profiling has a steep on-ramp. The traces are dense walls of colored rectangles. The events carry intimidating names. Most tutorials assume you can already read them. So even when we know we should be profiling, opening a trace can feel like a chore best left for later (or for someone else). This post, and the series it kicks off, is our attempt to lower that on-ramp. 问题在于，性能分析的入门门槛很高。追踪记录（traces）看起来就像是一堵密密麻麻的彩色矩形墙，事件名称也令人望而生畏。大多数教程都默认你已经具备了阅读这些内容的能力。因此，即使我们知道应该进行性能分析，打开一个追踪文件也常常感觉是一项可以留到以后（或者留给别人）去做的苦差事。这篇文章以及随后的系列文章，旨在降低这一入门门槛。

This is the opening post of Profiling in PyTorch, a series where we slowly build the skill of reading profiler traces and use it to drive optimization. The plan: 这是《PyTorch 性能分析》系列的第一篇文章，我们将在这个系列中逐步培养阅读性能分析追踪记录的技能，并利用它来推动优化。计划如下：

Part 1 (this post): start with the simplest possible operation, a matrix multiplication followed by a bias add, and learn how to read what the profiler hands back.
第一部分（本文）： 从最简单的操作开始——矩阵乘法后接偏置加法，学习如何解读性能分析器返回的结果。
Part 2: scale up to nn.Linear and a small MLP, use the traces to motivate optimizations, and peek at the kernels underneath.
第二部分： 扩展到 nn.Linear 和小型 MLP，利用追踪记录来激发优化思路，并深入探究底层的算子（kernels）。
Part 3: put it all together on Large Language Models with transformers.
第三部分： 将所有知识整合，应用于基于 Transformer 的大语言模型。

We document the journey from a beginner’s point of view. No prerequisites apart from basic PyTorch. Treat this as a leisurely read with some “Aha!” moments. The structure of the post is intentionally question-led: we open a trace, ask “wait, why is that happening?”, and chase the answer until something clicks. 我们从初学者的视角记录了这段旅程。除了基础的 PyTorch 知识外，没有其他先决条件。请将其视为一篇轻松的读物，并期待其中的“顿悟”时刻。文章结构特意采用了问题导向：我们打开一个追踪记录，问“等等，为什么会这样？”，然后追寻答案，直到豁然开朗。

By the end you should know: how to set up torch.profiler and what it actually hands back, how to read the profiler table and the trace (CPU lane, GPU lane, and the suspicious gaps in between), the chain of events from a Python call all the way down to a CUDA kernel, what changes (and, more interestingly, what does not change) when you slap torch.compile on top. 读完本文，你应该了解：如何设置 torch.profiler 以及它实际返回了什么；如何阅读性能分析表和追踪记录（CPU 通道、GPU 通道以及两者之间可疑的间隙）；从 Python 调用一直到 CUDA 算子的事件链；以及当你使用 torch.compile 时，哪些发生了变化（更有趣的是，哪些没有变化）。

Before we begin, two definitions that will make everything below read better: 在开始之前，有两个定义将有助于你更好地理解下文：

A GPU kernel is a program that runs in parallel on many threads of the GPU. The CPU schedules and launches these kernels. You don’t usually have to write GPU kernels yourself; when you use a PyTorch operation, it is automatically translated to one or more kernels that do the job on GPU.
GPU 算子（Kernel） 是在 GPU 的多个线程上并行运行的程序。CPU 负责调度和启动这些算子。你通常不需要自己编写 GPU 算子；当你使用 PyTorch 操作时，它会自动转换为一个或多个在 GPU 上执行任务的算子。

With those two ideas in your back pocket, let’s start asking questions. Here is the entire script that we use for the post: 01_matmul_add.py. We recommend opening this script in a separate tab and walk through the code step by step. We use the NVIDIA A100-SXM4-80GB GPU to run the scripts. 掌握了这两个概念后，让我们开始提出问题。这是我们本文使用的完整脚本：01_matmul_add.py。建议你在单独的标签页中打开此脚本，并逐步阅读代码。我们使用 NVIDIA A100-SXM4-80GB GPU 来运行这些脚本。

The matrix multiplication and addition operation

矩阵乘法与加法操作

As correctly quipped by Dr. Sara Hooker, just as we are primarily made up of water, Deep Neural Networks are primarily made up of matrix multiplies. As fundamental as they are, it would be a shame to start our profiling journey with anything else. 正如 Sara Hooker 博士所言，正如我们主要由水组成一样，深度神经网络主要由矩阵乘法组成。既然它们如此基础，如果我们的性能分析之旅从其他任何东西开始，那将是一种遗憾。

def fn(x, w, b):
    return torch.add(torch.matmul(x, w), b)

The matrix addition along with the matrix multiplication mimics how weights and biases interact in a neuron. This addition (pun intended) will help us understand how it paves the way for compilation later in the post. 矩阵加法与矩阵乘法模拟了神经元中权重和偏置的交互方式。这种“加法”（双关语）将帮助我们理解它如何为后续文章中的编译优化铺平道路。

To profile, we will be using the torch.profiler module. The steps involved are: 为了进行性能分析，我们将使用 torch.profiler 模块。涉及的步骤如下：

Have the code to profile ready (here def fn, which wraps the matrix multiplication and matrix addition).
准备好要分析的代码（此处为 def fn，它封装了矩阵乘法和矩阵加法）。
Annotate the algorithm. While this is completely optional, we recommend doing this. The record_function annotates our function as matmul_add, which will be easy to navigate in the traces (as we note later).
标注算法。虽然这完全是可选的，但我们建议这样做。record_function 将我们的函数标注为 matmul_add，这在追踪记录中将非常容易定位（正如我们稍后会提到的）。

def step():
    with torch.profiler.record_function("matmul_add"):
        return fn(x, w, b)

Wrap the code with the torch.profiler.profile context manager: 使用 torch.profiler.profile 上下文管理器包裹代码：

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU, # the cpu activities
        torch.profiler.ProfilerActivity.CUDA, # the gpu activities
    ],
) as prof:
    # it is recommended to run events multiple times to warm up the GPUs
    for _ in range(5):
        step()
    prof.step()

Export the profile: 导出分析结果：

# the profiler table
prof.key_averages().table(sort_by="cuda_time_total", row_limit=15)

# the profiler trace
prof.export_chrome_trace(trace_path)

The profiler exports two distinct artifacts: 性能分析器会导出两个不同的产物：

The profiler table: Provides the statistical summary of the algorithm. It answers “What is taking the most time”. This becomes really helpful to figure out hotspots. A hotspot would be events that take the most amount of time, can be a bottleneck of the pipeline, or an event that is triggered a lot of times.
性能分析表： 提供算法的统计摘要。它回答了“什么操作耗时最长”。这对于找出热点（hotspots）非常有帮助。热点是指耗时最长的事件，它们可能是流水线的瓶颈，或者是被频繁触发的事件。
The profiler trace: Provides the temporal execution view. Answers “When and Why an operation happened”, depicting the activities taking place on the CPU and the GPU. This is helpful when we want to investigate the kernel(s) that were launched, any delays in launching them, any overlap between CPU and GPU activities, etc.
性能分析追踪记录： 提供时间轴执行视图。它回答了“操作何时发生以及为何发生”，描绘了 CPU 和 GPU 上发生的活动。当我们想要调查启动了哪些算子、启动过程中的延迟、CPU 和 GPU 活动之间的重叠情况等时，这非常有用。

Let’s see the two in action with our first execution. (Here is the entire 01_matmul_add.py script). It is recommended to run this script on a machine with a GPU. 让我们通过第一次执行来看看这两者的实际效果。（这是完整的 01_matmul_add.py 脚本）。建议在带有 GPU 的机器上运行此脚本。

uv run 01_matmul_add.py --size 64

If you run the above script (on a GPU machine) you will find a folder traces/01_matmul_add with the two artifacts: 如果你在 GPU 机器上运行上述脚本，你会在 traces/01_matmul_add 文件夹中找到两个产物：

64_bf16_cold_eager.json
64_bf16_cold_eager.txt

Figure 1: Profiler table for matmul add on 64 sized matrices 图 1：64 大小矩阵的 matmul add 性能分析表

The .txt file holds the profiler table. Upon opening the file, as shown in Figure 1, one would be greeted with a big table with the first column consisting of the events that were triggered inside the scope of profile. The other columns are related to the time the event takes on the CPU or GPU or any other device(s) specified in activities within torch.profiler.profile. Look at which events take the most amount of time, and try to intuitively understand if that event should in fact take that time. .txt 文件中保存了性能分析表。打开该文件（如图 1 所示），你会看到一个大表格，第一列包含了在 profile 作用域内触发的事件。其他列则与该事件在 CPU、GPU 或 torch.profiler.profile 中 activities 指定的其他设备上所花费的时间有关。观察哪些事件耗时最长，并尝试直观地理解该事件是否确实应该花费那么多时间。