Data Access Patterns That Makes Your CPU Really Angry

Data Access Patterns That Makes Your CPU Really Angry

让 CPU 极其“愤怒”的数据访问模式

Given an array of data, what is the slowest way to sum up the integers? Is it adding the numbers from left to right, adding them randomly, or doing something else? In this post, we are going to build a data access pattern from the ground up that sums numbers as slowly as possible by exploiting memory pitfalls.

给定一个数据数组,求和其中整数最慢的方法是什么?是按从左到右的顺序相加,随机相加,还是其他方式?在这篇文章中,我们将通过利用内存陷阱,从零开始构建一种尽可能缓慢地对数字进行求和的数据访问模式。

uint32_t* data = ...; 
// sequential: data[0] + data[1] + data[2] + ... 
// random: data[67] + data[69420] + data[42] + ... 
// the slowest: data[A] + data[B] + data[C] + ...

Spoiler: You can do >30% worse than a randomized access pattern.

剧透:你可以做到比随机访问模式慢 30% 以上。

Some rules: Only consider the time taken to run the accumulation function. The time taken to create positions isn’t included. The accumulation function is fixed as follows, and data is filled randomly and we are only allowed to change the contents of positions:

一些规则: 仅考虑运行累加函数所花费的时间。创建位置数组的时间不计入在内。 累加函数固定如下,数据是随机填充的,我们只允许更改 positions 数组的内容:

constexpr int ELEMENT_COUNT = (1 << 16) * (PAGE_SIZE / sizeof(uint32_t)); // 2^26 
/* data contains integers that we want to sum up 
 * positions is the access pattern that we use to sum up the integers 
 * Overflow is expected, but we don’t really care about the actual sum, do we? */ 
uint32_t accumulator(uint32_t const* data, uint32_t const* positions) { 
    uint32_t total = 0; 
    for (uint32_t i = 0; i < ELEMENT_COUNT; ++i) { 
        uint32_t pos = positions[i]; 
        total += data[pos]; 
    } 
    return total; 
}

Measurements of accumulator durations are based on using rdtsc cycle count.

累加器持续时间的测量基于 rdtsc 时钟周期计数。

Some additional notes: There are 2^26 integers: using 65536 pages, and each page contains 1024 integers. These numbers are chosen simply so it doesn’t take too long on my machine to run. Huge pages are disabled. All measurements are based on my machine:

一些补充说明: 共有 2^26 个整数:使用 65536 个页面,每个页面包含 1024 个整数。选择这些数字仅仅是为了让它在我的机器上运行不会花费太长时间。 大页(Huge pages)已禁用。 所有测量均基于我的机器:

(Machine specs omitted for brevity)

(机器配置参数略)

The full code can be found here, run with g++ -std=c++2a -O3 slowest.cc && taskset -c 3 sudo ./a.out. I highly recommend opening up slowest.cc and running the code.

完整代码可以在这里找到,使用 g++ -std=c++2a -O3 slowest.cc && taskset -c 3 sudo ./a.out 运行。我强烈建议打开 slowest.cc 并运行代码。

Our job is to find the permutation of positions that yields the slowest possible timing. Let’s begin with the simplest access pattern:

我们的任务是找到能产生最慢执行时间的 positions 排列。让我们从最简单的访问模式开始:

void linear(uint32_t const* data, uint32_t* positions) { 
    for (uint32_t i = 0; i < ELEMENT_COUNT; ++i) { 
        positions[i] = i; 
    } 
}

This is likely the fastest permutation, taking 133M (132752394) cycles. This is expected, since CPUs are heavily optimised for sequential accesses.

这很可能是最快的排列,耗时 1.33 亿(132,752,394)个周期。这是预料之中的,因为 CPU 对顺序访问进行了深度优化。

On the other hand, we could randomise the permutation of positions.

另一方面,我们可以随机化 positions 的排列。

void fisher_yates_shuffle(uint32_t const* data, uint32_t* positions) { 
    linear(data, positions); 
    uint32_t remaining = ELEMENT_COUNT; 
    for (uint32_t i = 0; i < ELEMENT_COUNT; ++i) { 
        uint32_t random = rand() % remaining; 
        uint32_t tmp = positions[i]; 
        positions[i] = positions[i + random]; 
        positions[i + random] = tmp; 
        --remaining; 
    } 
}

Now, the CPU cannot predict which data it will access next, so randomised access takes 1.57B (1572108618) cycles, which is over 10x worse than with linear access. Could we do worse? Of course. Let’s build up the worst permutation, starting with a simple regression.

现在,CPU 无法预测接下来要访问哪个数据,因此随机访问耗时 15.7 亿(1,572,108,618)个周期,比线性访问慢了 10 倍以上。我们还能做得更慢吗?当然可以。让我们从一个简单的回归开始,构建最差的排列。

Start by setting positions such that every consecutive element accessed is always separated by a cache line, which is the unit of data that is stored in a cache:

首先设置 positions,使得每次连续访问的元素总是被一个缓存行(Cache Line)隔开,这是缓存中存储数据的基本单位:

void separated_by_a_cacheline(uint32_t const* data, uint32_t* positions) { 
    constexpr int element_count_per_cacheline = CACHELINE_SIZE / sizeof(uint32_t); 
    constexpr int cacheline_count = ELEMENT_COUNT / element_count_per_cacheline; 
    static_assert(ELEMENT_COUNT % element_count_per_cacheline == 0); 
    int current = 0; 
    for (int element_index = 0; element_index < element_count_per_cacheline; ++element_index) { 
        for (int cacheline_index = 0; cacheline_index < cacheline_count; ++cacheline_index) { 
            positions[current] = cacheline_index * element_count_per_cacheline + element_index; 
            ++current; 
        } 
    } 
}

This pattern is terrible because each access uses one 4-byte integer from a 64-byte cache line before moving on. By the time we come back to the same cache line, the useful reuse cache would have been evicted. This culminates in a horrible performance with a cycle count of 719M (718804156), already taking 4x longer than a linear scan.

这种模式非常糟糕,因为每次访问只使用 64 字节缓存行中的一个 4 字节整数,然后就跳转了。当我们回到同一个缓存行时,原本有用的缓存数据早已被驱逐。这导致了极差的性能,周期计数为 7.19 亿(718,804,156),已经比线性扫描慢了 4 倍。

When accessing elements separated by a cache line, the hardware prefetchers can still recognise a simple streaming pattern in data and start fetching future cache lines before the load requests them. However, many Intel hardware data prefetchers do not prefetch across 4 KiB page boundaries, so this help does not carry smoothly from one page to the next. My guess is that crossing a page boundary requires another virtual-to-physical translation, and adjacent virtual pages are not guaranteed to map to adjacent physical pages, so speculative cross-page prefetches are riskier and less generally useful.

当访问被缓存行隔开的元素时,硬件预取器仍然可以识别出数据中简单的流式模式,并在加载请求发出之前就开始获取后续的缓存行。然而,许多 Intel 硬件数据预取器不会跨越 4 KiB 的页面边界进行预取,因此这种优化无法平滑地从一个页面延续到下一个页面。我的猜测是,跨越页面边界需要进行另一次虚拟地址到物理地址的转换,而相邻的虚拟页面并不保证映射到相邻的物理页面,因此跨页面的推测性预取风险更高,且通用性较差。

So instead of only separating our access by a cache line, separate it by a full page instead. Each page is 4096 bytes, and the code looks as follows:

因此,与其只用缓存行来分隔访问,不如用整个页面来分隔。每个页面为 4096 字节,代码如下:

void separated_by_a_page(uint32_t const* data, uint32_t* positions) { 
    constexpr int element_count_per_page = PAGE_SIZE / sizeof(uint32_t); 
    constexpr int page_count = ELEMENT_COUNT / element_count_per_page; 
    static_assert(ELEMENT_COUNT % element_count_per_page == 0); 
    int current = 0; 
    for (int element_index = 0; element_index < element_count_per_page; ++element_index) { 
        for (int page_index = 0; page_index < page_count; ++page_index) { 
            positions[current] = page_index * element_count_per_page + element_index; 
            ++current; 
        } 
    } 
}

There is a significant regression to 1.41B (1411153154) cycles. While the above talks about hindering the HW prefetcher, there is another mem…

性能出现了显著倒退,达到 14.1 亿(1,411,153,154)个周期。虽然上面讨论的是如何阻碍硬件预取器,但还有另一个内存……