Leaving performance on the table

性能的遗珠：别让优化潜力白白流失

I have been working with LLVM at $DAYJOB$, and I have gotten to become familiar with the benefits of optimizing your workloads. I tend to think of optimizing my binaries as thinking about whether I have attached -O3 to my compiler flags; maybe if I’m particularly advanced that day I’ll sprinkle in some -flto (link time optimization) and call it a day. Turns out though that’s leaving lots of performance on the table.

在日常工作中，我一直在使用 LLVM，并逐渐熟悉了优化工作负载所带来的收益。过去，我对于优化二进制文件的理解往往仅限于是否添加了 -O3 编译选项；如果那天我感觉“进阶”了一些，可能会再加个 -flto（链接时优化），然后就觉得大功告成了。但事实证明，这样做其实浪费了大量的性能潜力。

Compilers work under the assumption that every branch is equally taken, unless you are hints like [[likely]] (ref). If we can feed the compilers more information about the likely path that our workloads often take, then they can produce much more performant code. There are two primary ways to optimize a binary: instrumented or statistical. When we instrument our binary, we run our workload with an instrumented binary and capture the exact paths that are executed. We will then optimize the binary perfectly tuned to that workload. If our workloads however are varied, we can collect profiles via perf over a length of time and create an optimized binary based on the statistical occurrence of call graphs. Both approaches have their benefits however let’s start with the instrumented variant first, as it’s a little easier to follow and understand.

编译器通常假设每个分支被执行的概率是相等的，除非你使用了类似 [[likely]] 这样的提示。如果我们能向编译器提供更多关于工作负载实际执行路径的信息，它们就能生成性能高得多的代码。优化二进制文件主要有两种方式：插桩（Instrumented）和统计（Statistical）。当我们对二进制文件进行插桩时，会运行带有插桩代码的程序，并捕获实际执行的路径，从而针对该工作负载进行完美调优。如果工作负载多变，我们也可以通过 perf 在一段时间内收集性能分析数据，并基于调用图的统计出现频率来构建优化后的二进制文件。这两种方法各有利弊，但我们先从插桩方式开始，因为它更容易理解和上手。

Let’s look at a very simple benchmark. We will calculate Fibonacci using SQL in sqlite3. This is an ideal workload because it’s purely CPU-bound and ripe for optimizing.

让我们来看一个非常简单的基准测试。我们将使用 sqlite3 中的 SQL 语句来计算斐波那契数列。这是一个理想的工作负载，因为它完全受限于 CPU，非常适合进行优化。

(Code block omitted for brevity)

We will compile sqlite3 from source by downloading it.

我们将通过下载源码来编译 sqlite3。

(Commands omitted for brevity)

We can compile a “traditional” optimized binary that merely has -O3 and also a version that has LTO enabled since I was also keen to see how much LTO itself adds.

我们可以编译一个仅带有 -O3 的“传统”优化版二进制文件，以及一个启用了 LTO 的版本，因为我也很想看看 LTO 本身能带来多少提升。

(Commands and output omitted for brevity)

Ok, so it looks like our program takes roughly 14-15 seconds to run. Sounds ok? How much better can we do…. 🤔

好了，看起来我们的程序运行大约需要 14-15 秒。听起来还行？但我们还能做得更好吗…… 🤔

Next, we compile our program again but we instrument the binary, which effectively injects counters into the program to count invocations of functions. We get very accurate counts of our calls but the binary itself now runs much slower, which can be a problem if your workload was already very slow. Luckily for us, we are in a time domain (~15 seconds), where that is ok. After we have our instrumented binary, we run our workload again to generate the profile data and rebuild the binary with that data.

接下来，我们再次编译程序，但这次会对二进制文件进行插桩。这实际上是在程序中注入计数器来统计函数的调用次数。我们能得到非常精确的调用计数，但二进制文件本身的运行速度会变慢，如果你的工作负载本来就很慢，这可能会是个问题。幸运的是，我们的运行时间在可控范围内（约 15 秒），所以没问题。在得到插桩后的二进制文件后，我们再次运行工作负载以生成配置文件数据，并利用这些数据重新构建二进制文件。

(Commands omitted for brevity)

The last step will be to optimize with BOLT, which is a post-link optimizer, which requires us to keep relocations so I’ve also added -Wl,-q. When we run our workload with the final optimized binary, we see massive improvement already! 🤯

最后一步是使用 BOLT 进行优化。BOLT 是一个链接后优化器，它要求保留重定位信息，所以我添加了 -Wl,-q。当我们使用最终优化后的二进制文件运行工作负载时，已经看到了巨大的性能提升！🤯

(Commands and output omitted for brevity)

We’ve cut our workload time down to ~10 seconds which is a nearly a 1.5x improvement. Now let’s optimize the final binary with LLVM’s BOLT. BOLT is a post-link optimizer designed for “large applications”. What this means, is that it largely works by shuffling code around the binary to keep code-paths that have high temporal locality near each other (spatial locality). This can have positive impact on performance due to the instruction cache for instance.

我们将工作负载时间缩短到了约 10 秒，这几乎是 1.5 倍的性能提升。现在，让我们用 LLVM 的 BOLT 对最终的二进制文件进行优化。BOLT 是专为“大型应用程序”设计的链接后优化器。这意味着它主要通过在二进制文件中重新排列代码，使具有高时间局部性的代码路径彼此靠近（空间局部性）。例如，这会对指令缓存产生积极的性能影响。

(Commands and output omitted for brevity)

Looks like it was a little faster but not much. That makes sense since sqlite3 itself is a pretty small binary (~6MB), but nonetheless was good to run through. Running a more thorough benchmark with hyperfine we can get a final tally of our results.

看起来速度快了一点，但提升并不明显。这很合理，因为 sqlite3 本身是一个相当小的二进制文件（约 6MB），但无论如何，走一遍这个流程还是很有意义的。通过使用 hyperfine 进行更彻底的基准测试，我们可以得到最终的统计结果。

Build Configuration	Mean Time ± σ	Min … Max	Relative Speed (vs Fastest)
PGO+BOLT	10.536 s ± 0.051 s	10.491 s … 10.631 s	1.00
PGO	10.805 s ± 0.055 s	10.733 s … 10.889 s	1.03 ± 0.01
LTO	14.252 s ± 0.026 s	14.225 s … 14.315 s	1.35 ± 0.01
Basic (no LTO)	14.292 s ± 0.071 s	14.224 s … 14.435 s	1.36 ± 0.01