Every Byte Matters

Every Byte Matters / 每一字节都至关重要

I have spent a large portion of my career working in Java. In that time, you get used to huge classes. New functionality? Just add a new method and field to the class. The cost of each new field is rarely considered. Performance is often considered from a classic computer science perspective by considering asymptotic analysis of the algorithms and data structures in-use. 在我的职业生涯中，大部分时间都在使用 Java。在那段时间里，你习惯了庞大的类。有了新功能？只需在类中添加一个新方法和字段即可。每个新字段的成本很少被考虑。性能通常是从经典的计算机科学角度来考量的，即通过对所使用的算法和数据结构进行渐近分析。

Turns out that even within a growth scale for your algorithm, such as a simple for-loop O(N), time can vary dramatically if we have a little deeper understanding of the underlying hardware. First, let’s understand our current machine. Let’s take a peek at our cache line and page sizes. 事实证明，即使在算法的增长规模内（例如简单的 for 循环 O(N)），如果我们对底层硬件有更深入的了解，执行时间也会有巨大的差异。首先，让我们了解一下当前的机器。让我们看看缓存行（cache line）和页面大小。

$ lscpu | grep -i cache
L1d cache: 352 KiB (10 instances)
L1i cache: 640 KiB (10 instances)
L2 cache: 10 MiB (5 instances)
L3 cache: 12 MiB (1 instance)

$ getconf LEVEL1_DCACHE_LINESIZE
64

The instances number is a reflection of how the caches are shared amongst CPUs. If I had 10 CPUs, each one has their own L1d cache, whereas two of them would share an L2 cache. Our cache line size is 64 bytes. 实例数量反映了缓存如何在 CPU 之间共享。如果我有 10 个 CPU，每个 CPU 都有自己的 L1d 缓存，而其中两个 CPU 会共享一个 L2 缓存。我们的缓存行大小为 64 字节。

┌─────────────────────────────────────────────┐ │ 64 bytes │ │ byte 0 byte 1 byte 2 … byte 63 │ └─────────────────────────────────────────────┘

When you read a single byte from memory, the hardware will fill the surrounding 64 bytes into the cache line. The idea being that data is often temporal and spatially located, meaning data is often accessed near each other and close in time to each other. 当你从内存中读取单个字节时，硬件会将周围的 64 字节填充到缓存行中。其理念是数据通常具有时间和空间局部性，这意味着数据往往在空间上彼此靠近，且在时间上被相继访问。

We can reference Jeff Dean’s famous “Latency numbers every programmer should know”, however a quick recap with the values from our particular machine is the following: 我们可以参考 Jeff Dean 著名的“每个程序员都应该知道的延迟数字”，但以下是我们特定机器数值的快速回顾：

┌──────────────────────────────────────────────────────────────┐ │ CPU Core │ │ ┌───────────┐ │ │ │ Registers │ < 1 ns │ │ └─────┬─────┘ │ │ ▼ │ │ ┌───────────┐ │ │ │ L1d Cache │ ~35 KiB/core ~4-5 cycles ~1-2 ns │ │ │ │ ~560 cache lines │ │ └─────┬─────┘ │ │ ▼ │ │ ┌───────────┐ │ │ │ L2 Cache │ ~2 MiB/core-pair ~12-15 cycles ~4-5 ns │ │ │ │ ~32,000 cache lines │ │ └─────┬─────┘ │ │ ▼ │ │ ┌───────────┐ │ │ │ L3 Cache │ 12 MiB shared ~30-40 cycles ~10-15 ns │ │ │ │ ~196,000 cache lines │ │ └─────┬─────┘ │ │ ▼ │ │ ┌───────────┐ │ │ │ DRAM │ ~100-200 cycles ~60-100 ns │ │ └───────────┘ │ └──────────────────────────────────────────────────────────────┘

The sizes for each cache, is the number returned by lscpu divided by the number of cores or instances; i.e. 352 KiB ÷ 10 instances = ~35 KiB. We then determine the number of cache lines by dividing this number by 64; i.e. 35 KiB ÷ 64 bytes = 560 cache lines. 每个缓存的大小是 lscpu 返回的数值除以核心数或实例数；例如：352 KiB ÷ 10 个实例 = ~35 KiB。然后，我们将此数值除以 64 来确定缓存行数；例如：35 KiB ÷ 64 字节 = 560 个缓存行。

How does this all matter? 🤔 Let’s consider an example where we want to iterate over a single struct Monster and pull out the boolean is_alive to filter them. We create our struct, and in this particular example we need 64 bytes to represent a single Monster. 这一切有什么意义呢？🤔 让我们考虑一个例子：我们想要遍历一个 Monster 结构体并提取布尔值 is_alive 来进行过滤。我们创建了结构体，在这个特定示例中，我们需要 64 字节来表示一个 Monster。

struct Monster {
    uint32_t id;        // 4 bytes
    float x, y, z;      // 12 bytes
    float vx, vy, vz;   // 12 bytes
    int32_t hp;         // 4 bytes
    int32_t attack;     // 4 bytes
    int32_t defense;    // 4 bytes
    uint8_t is_alive;   // 1 byte
    uint8_t team;       // 1 byte
    char name[22];      // 22 bytes
}; // total: 64 bytes

If we had an array of Monsters and we iterate over them, the cache line would fill up like so. Each cache line would fill with a single monster, and we would fetch only the is_alive byte. This is often referred to as “Array of Structs”. 如果我们有一个 Monster 数组并对其进行遍历，缓存行的填充方式如下。每个缓存行将填充一个 Monster，而我们实际上只获取了 is_alive 字节。这通常被称为“结构体数组”（Array of Structs）。

If we instead normalize the data such that each field is in it’s own list, we can pack the cache lines much tighter. 如果我们对数据进行规范化，使每个字段都在自己的列表中，我们就可以更紧凑地填充缓存行。

// SoA layout
struct Monsters {
    uint32_t *ids;
    float *xs, *ys, *zs;
    float *vxs, *vys, *vzs;
    int32_t *hps;
    int32_t *attacks;
    int32_t *defenses;
    uint8_t *is_alives; // packed contiguously
    uint8_t *teams;
    char (*names)[22];
};

This type of layout is referred to as “Struct of Arrays”. How much of an impact can this have? We can observe up to 30x improvements when the Monster struct is 1KiB 🤯. The delta is less observable when the struct is small because multiple Monster structs can still be fetched within a single cache-line. 这种布局被称为“数组结构体”（Struct of Arrays）。这能产生多大的影响？当 Monster 结构体大小为 1KiB 时，我们可以观察到高达 30 倍的性能提升 🤯。当结构体较小时，这种差异不太明显，因为多个 Monster 结构体仍然可以在单个缓存行内被获取。

This data access is incredibly hot though. Your CPU pre-fetcher knows it’s going sequentially and fetches the next cache line before you need it. You never actually have to wait for the memory to be fetched. 不过，这种数据访问是非常频繁的。CPU 的预取器（pre-fetcher）知道你在进行顺序访问，并在你需要之前就获取了下一个缓存行。你实际上永远不必等待内存被加载。

What about random access patterns? Not all access patterns are sequential. Hash maps, trees, graph traversal, and pointer-heavy data structures jump to unpredictable locations. The CPU can’t prefetch what it can’t predict. With random access, the CPU needs the entire array to be present in the cache in order to avoid stalls due to memory lookup. This means the total size of your collection determines your performance tier. 那么随机访问模式呢？并非所有的访问模式都是顺序的。哈希映射、树、图遍历和指针密集型数据结构会跳转到不可预测的位置。CPU 无法预取它无法预测的内容。对于随机访问，CPU 需要整个数组都存在于缓存中，以避免因内存查找而导致的停顿。这意味着集合的总大小决定了你的性能层级。

Doubling the struct from 64B to 128B doubles the working set for the same number of monsters, pushing the data into slower cache levels. At just 512 monsters, a 64B struct fits in L1d at ~3 ns — but a 128B struct has already spilled to L2 at ~11 ns. We can observe this with a pointer-chasing benchmark. We allocate N monster-sized nodes, wire them into a random order, and chase pointers. Each hop lands at an unpredictable address, defeating the CPU’s prefetcher entirely. 将结构体从 64B 增加到 128B，在相同数量的 Monster 下，工作集（working set）大小翻倍，将数据推向了更慢的缓存层级。仅在 512 个 Monster 时，64B 的结构体可以放入 L1d 缓存，延迟约为 3 纳秒；但 128B 的结构体已经溢出到 L2 缓存，延迟约为 11 纳秒。我们可以通过指针追踪基准测试观察到这一点。我们分配 N 个 Monster 大小的节点，将它们以随机顺序连接起来，然后追踪指针。每次跳转都会落在一个不可预测的地址上，从而完全击败了 CPU 的预取器。