CSPNet Paper Walkthrough: Just Better, No Tradeoffs

CSPNet 论文解读：更优性能，无需权衡

Deep Learning CSPNet Paper Walkthrough: Just Better, No Tradeoffs. A review of the Cross-Stage Partial Network paper — and a from-scratch PyTorch implementation. 深度学习 CSPNet 论文解读：更优性能，无需权衡。本文是对跨阶段局部网络（Cross-Stage Partial Network）论文的综述，并包含从零开始的 PyTorch 实现。

How do you make your CNN-based model more lightweight? Just take the smaller version of that model, right? Like with ResNet, for instance, if ResNet-152 feels too heavy, why not just use ResNet-101? Or in the case of DenseNet, why not go with DenseNet-121 rather than DenseNet-169? — Yes, that’s true, but you would have to sacrifice some accuracy for that. Basically, if you want a lighter model then you should expect your accuracy to drop as well. 你该如何让基于 CNN 的模型变得更轻量？直接使用该模型的较小版本，对吧？例如 ResNet，如果 ResNet-152 感觉太重，为什么不直接用 ResNet-101 呢？或者对于 DenseNet，为什么不选择 DenseNet-121 而非 DenseNet-169？—— 是的，这没错，但你必须为此牺牲一些准确率。基本上，如果你想要一个更轻量的模型，就得预料到准确率也会随之下降。

Now, what if I told you about a model that’s more lightweight than its base but can still compete on accuracy? Meet CSPNet (Cross Stage Partial Network). You’ll be surprised that it can effectively reduce computational complexity while maintaining high accuracy — no tradeoff! In this article we are going to talk about the CSPNet architecture, including how it works and how to implement it from scratch. 那么，如果我告诉你有一种模型，它比基础版本更轻量，却依然能在准确率上保持竞争力呢？来认识一下 CSPNet（跨阶段局部网络）。你会惊讶地发现，它能在有效降低计算复杂度的同时保持高准确率——无需权衡！在本文中，我们将探讨 CSPNet 的架构，包括它的工作原理以及如何从零开始实现它。

A Brief History of CSPNet

CSPNet 简史

CSPNet was first introduced in a paper titled “CSPNet: A New Backbone That Can Enhance Learning Capability of CNN” written by Wang et al. back in November 2019 [1]. CSPNet was originally proposed to address the limitations of DenseNet. Despite already being computationally cheaper than ResNet, the authors thought that the computation of DenseNet itself is still considered expensive. Take a look at the main building block of a DenseNet in Figure 1 below to understand why. CSPNet 最初由 Wang 等人在 2019 年 11 月发表的论文《CSPNet: A New Backbone That Can Enhance Learning Capability of CNN》中提出 [1]。CSPNet 的提出旨在解决 DenseNet 的局限性。尽管 DenseNet 在计算上已经比 ResNet 更廉价，但作者认为 DenseNet 本身的计算量仍然被认为是昂贵的。请看下图 1 中 DenseNet 的主要构建模块，以了解原因。

(Figure 1. The main building block of a DenseNet model [2].) (图 1. DenseNet 模型的主要构建模块 [2]。)

In a DenseNet building block — called dense block — every convolution layer takes information from all previous layers, causing it to have a lot of redundant gradient information that makes training inefficient. We can think of it like a student taught by 5 different teachers for the same material. It’s actually good since the student can get multiple perspectives about that specific topic. However, at some point it becomes redundant and thus inefficient. 在 DenseNet 的构建模块（称为密集块，dense block）中，每个卷积层都会接收来自所有先前层的信息，这导致它包含大量冗余的梯度信息，从而降低了训练效率。我们可以将其想象成一个学生由 5 位不同的老师教授同一门课程。这其实是好事，因为学生可以从多个角度理解该主题。然而，在某种程度上，这会变得冗余且低效。

In the case of DenseNet, we can see the deeper layers as students and all the tensors from shallower layers as teachers. In the example above, if we assume H₄ as our student, then the x₀, x₁, x₂, and x₃ tensors act as the teachers. Here you can just imagine how that student would get overwhelmed by all that information! 在 DenseNet 的案例中，我们可以将更深的层视为学生，将来自较浅层的所有张量视为老师。在上面的例子中，如果我们假设 H₄ 是学生，那么 x₀、x₁、x₂ 和 x₃ 张量就充当了老师。在这里，你可以想象一下那个学生会被所有这些信息淹没！

Before we get into CSPNet, I actually have a whole separate article specifically talking about DenseNet (reference [3]), which I highly recommend you read if you want the full picture of how this architecture works. 在深入探讨 CSPNet 之前，我其实有一篇专门讨论 DenseNet 的独立文章（参考 [3]），如果你想全面了解该架构的工作原理，我强烈建议你阅读它。

Objectives

目标

The objective of CSPNet is to enable a network to have cheaper computational complexity and better gradient combination. The reason for the latter is that most gradient information in DenseNet consists of duplicates of each other. It is important to note that CSPNet is not a standalone network. Instead, it is a new paradigm we apply to DenseNet. Now let’s take a look at Figure 2 below to see how CSPNet achieves its objectives. CSPNet 的目标是使网络具有更低的计算复杂度和更好的梯度组合。后者的原因是 DenseNet 中大多数梯度信息都是彼此重复的。需要注意的是，CSPNet 并不是一个独立的网络。相反，它是一种我们应用于 DenseNet 的新范式。现在让我们看看下图 2，了解 CSPNet 是如何实现其目标的。

You can see the illustration on the left that the number of feature maps gradually increases as we get deeper into the network. If you have read my previous article about DenseNet, this is essentially something we can control through the growth rate parameter, i.e., the number of feature maps produced by each convolution layer within a dense block. In fact, this increase in the number of feature maps is what the authors see as a computational bottleneck. 你可以看到左侧的插图，随着网络深度的增加，特征图的数量逐渐增加。如果你读过我之前关于 DenseNet 的文章，这本质上是我们通过增长率（growth rate）参数可以控制的内容，即密集块内每个卷积层产生的特征图数量。事实上，特征图数量的这种增加正是作者所认为的计算瓶颈。

(Figure 2. Left: the original DenseNet building block (same as Figure 1). Right: The CSPNet version of the DenseNet building block (called CSPDenseNet) [1].) (图 2. 左：原始 DenseNet 构建模块（与图 1 相同）。右：CSPNet 版本的 DenseNet 构建模块（称为 CSPDenseNet）[1]。)

By applying the Cross Stage Partial mechanism, we can basically make the computation of a DenseNet cheaper. If we take a look at the illustration on the right, we can see that we have an additional branch coming out from x₀ that goes directly to the so-called Partial Transition Layer. There are at least two advantages we get with this mechanism, which are in accordance with the objectives I mentioned earlier. 通过应用跨阶段局部（Cross Stage Partial）机制，我们基本上可以使 DenseNet 的计算更廉价。如果我们观察右侧的插图，可以看到我们从 x₀ 引出了一条额外的分支，直接通向所谓的局部过渡层（Partial Transition Layer）。这种机制至少带来了两个优势，这与我之前提到的目标相一致。

First, we can save lots of computations since the number of feature maps processed by the dense block is only half of the original one. And second, the gradient information becomes more diverse since we got an additional path with unprocessed feature maps that avoids the redundant gradient information. So in short, the idea of CSPNet eliminates the computational redundancy of DenseNet (through the skip-path) while at the same time still preserves its feature-reuse property (through the dense block). 首先，我们可以节省大量计算，因为密集块处理的特征图数量仅为原始的一半。其次，梯度信息变得更加多样化，因为我们有了一条带有未处理特征图的额外路径，从而避免了冗余的梯度信息。简而言之，CSPNet 的思想消除了 DenseNet 的计算冗余（通过跳跃路径），同时保留了其特征重用属性（通过密集块）。

The Detailed CSPNet Architecture

CSPNet 详细架构

Speaking of the details, the original feature map is first divided into two parts in channel-wise manner, where each of them will be processed in different paths. Suppose we got 64 input channels, the first 32 feature maps (part 1) will skip through all computations, whereas the remaining 32 (part 2) will be processed by a dense block. Although this splitting step is pretty easy, the merging step is actually not quite trivial. You can see in Figure 3 below that we got several different mechanisms to do so. 说到细节，原始特征图首先按通道方式分为两部分，每一部分将在不同的路径中进行处理。假设我们有 64 个输入通道，前 32 个特征图（第一部分）将跳过所有计算，而其余 32 个（第二部分）将由密集块处理。虽然这个拆分步骤非常简单，但合并步骤实际上并不简单。你可以在下图 3 中看到，我们有几种不同的机制来实现这一点。

(Figure 3. Several different ways to perform feature combination in CSPNet [1].) (图 3. 在 CSPNet 中执行特征组合的几种不同方式 [1]。)

In the structure referred to as fusion first (c), we concatenate the part 1 tensor with the part 2 tensor that has been processed by the dense block prior to passing them through the transition layer. So, option (c) is actually pretty straightforward to implement because the spatial dimension of the two tensors is exactly the same, allowing us to concatenate them easily. 在被称为“先融合”（fusion first，图 c）的结构中，我们将第一部分张量与经过密集块处理的第二部分张量进行拼接，然后再将它们传递给过渡层。因此，选项 (c) 的实现实际上非常直接，因为两个张量的空间维度完全相同，使我们可以轻松地将它们拼接起来。

In my previous article [3], I mentioned that the transition layer of a DenseNet is used to reduce both the spatial dimension and the number of channels. In fact, this property requires us to rethink how to implement the fusion last (d) structure. This is essentially because the transition layer will cause the part 2 tensor to have a smaller spatial dimension than the part 1… 在我之前的文章 [3] 中，我提到 DenseNet 的过渡层用于减少空间维度和通道数量。事实上，这一特性要求我们重新思考如何实现“后融合”（fusion last，图 d）结构。这主要是因为过渡层会导致第二部分张量的空间维度小于第一部分……