QBE - Compiler Backend: Version 1.3

QBE compiler backend QBE 1.3 - Release notes qbe-1.3 QBE 1.3 took a while to cook, but it is the most significant release since 1.0 with around 7k new lines of code and 1.5k deleted ones. In addition to the usual bug fixes, QBE gained a new and original IL matching algorithm, new optimizations from Roland Paterson-Jones, Scott Graham added support for the Windows ABI, and I implemented a plan suggested by Michael Forney to have QBE produce position-independent code (as in shared objects). QBE is teamwork, and I am happy to thank all the contributors to this release. In the rest of the notes we will take a closer look at some of the meat of the release.

QBE 编译器后端 1.3 版本发布说明。QBE 1.3 的开发历时较长，但它是自 1.0 版本以来最重要的一次更新，新增了约 7000 行代码，删除了 1500 行代码。除了常规的错误修复外，QBE 还引入了一种全新的原创中间语言（IL）匹配算法，集成了 Roland Paterson-Jones 贡献的新优化，Scott Graham 增加了对 Windows ABI 的支持，我还实现了 Michael Forney 提出的方案，使 QBE 能够生成位置无关代码（如共享对象）。QBE 是团队合作的结晶，我很高兴感谢所有为此次发布做出贡献的人。在接下来的说明中，我们将深入探讨此次发布的核心内容。

Faster

Every once in a while, the QBE fly stings an outstanding programmer. This time, Roland was the victim! Roland suggested we look at the coremark benchmark to give us a concrete but simple playground to optimize. Initial measurements with qbe-1.2 revealed that we were far behind our “70% of gcc -O2” goal, closer to 40%. We decided to address this for the 1.3 release. An early inspection of profiling data revealed that the performance gap boiled down in large part to how two functions are treated: ee_isdigit and crcu8. Interestingly, these functions are not really idiomatic C; for instance, ee_isdigit is typically inlined textually, uses && instead of & and skips the superfluous ternary operator; as for CRC, it is best implemented using a pre-computed table. This observation was a bit disappointing because, aside from QBE’s lack of inlining, it did not point us to a general source of overhead that would apply to other compilation loads. On the other hand, it is expected that CPU-bound code spends a majority of time in compact code sections. Nonetheless, we implemented numerous optimizations (GVN/GCM, loop optimization, if-elimination, CFG simplification, …) that we could try on both coremark and more realistic usages like the Hare test suite. In the end, we decided to keep only a subset of vetted passes and now score more than 63% of the performance of commercial compilers on vanilla coremark. Notably, we excluded inlining from the optimizations set to postpone solving its incompatibility with the streaming per-function compilation model of QBE. Modifying the coremark benchmark to inline the ee_isdigit function and use a simpler branch-free implementation of crcu8 makes QBE reach its 70% goal. The new optimizations should also benefit Hare users: I measured a 33% improvement on the Hare test suite against qbe-1.2 (1.7s vs 2.6s).

更快

偶尔，QBE 的“灵感之虫”会叮咬一位杰出的程序员。这次，Roland 中招了！Roland 建议我们研究 coremark 基准测试，为优化提供一个具体而简单的实验场。使用 qbe-1.2 的初步测量显示，我们距离“达到 gcc -O2 性能的 70%”这一目标还很远，实际仅接近 40%。我们决定在 1.3 版本中解决这个问题。对性能分析数据的早期检查显示，性能差距很大程度上归结于两个函数的处理方式：ee_isdigit 和 crcu8。有趣的是，这些函数并非地道的 C 语言写法；例如，ee_isdigit 通常是文本内联的，使用 && 而非 &，并跳过了多余的三元运算符；至于 CRC，最好使用预计算表来实现。这一观察结果令人有些失望，因为除了 QBE 缺乏内联功能外，它并没有指出适用于其他编译负载的通用性能瓶颈。另一方面，CPU 密集型代码大部分时间花在紧凑的代码段中是预料之中的。尽管如此，我们还是实施了许多优化（GVN/GCM、循环优化、if 消除、CFG 简化等），并在 coremark 和更实际的用例（如 Hare 测试套件）上进行了尝试。最终，我们决定只保留经过验证的优化子集，目前在原生 coremark 上已能达到商业编译器 63% 以上的性能。值得注意的是，我们将内联优化排除在外，以推迟解决其与 QBE 流式逐函数编译模型的不兼容问题。通过修改 coremark 基准测试以内联 ee_isdigit 函数并使用更简单的无分支 crcu8 实现，QBE 可以达到 70% 的目标。这些新优化也将惠及 Hare 用户：我在 Hare 测试套件上测得比 qbe-1.2 提升了 33%（1.7 秒对比 2.6 秒）。

Smarter

Since its early days, QBE used a bottom-up tree-numbering algorithm inspired by Ken Thompson’s Plan9 C compiler (see 5.5. Addressability). The algorithm is fairly generic but has some subtleties in dealing elegantly with associativity and commutativity of arithmetic operators. It has been a long-standing goal of mine to implement a metaprogramming solution to this problem. QBE 1.3 sees the realization of this goal. A new OCaml tool called mgen is used to compile lispy IL patterns into idiomatic C code that matches them. The mgen tool will look for special comment blocks containing IL patterns and inline the matching C code right below these blocks. The generated C code is designed to look idiomatic in qbe and works similarly to the hand-written logic pre 1.3. See the isel.c file for the current use of mgen. In more detail, instruction DAGs are matched by following a numbering approach like in Ken Thompson’s compiler. Then, mgen associates each number with a bitset indicating which of the toplevel user patterns are matched by the current IL node (temporary); the most suitable pattern can then be selected by handwritten logic. Patterns can include variables which can be collected by running a matcher program. These programs are also generated by mgen in a simple bytecode language which the runmatch() function can interpret. I expect that in the future mgen is used to simplify instruction selection in more backends and maybe even to recognize IL patterns such as bit rotations in optimization passes.

更智能

自早期以来，QBE 一直使用受 Ken Thompson 的 Plan9 C 编译器启发（参见 5.5. Addressability）的自底向上树编号算法。该算法相当通用，但在优雅处理算术运算符的结合律和交换律方面存在一些微妙之处。实现一种元编程解决方案来解决这个问题一直是我长期的目标。QBE 1.3 实现了这一目标。一个名为 mgen 的新 OCaml 工具被用于将类似 Lisp 的 IL 模式编译为匹配它们的规范 C 代码。mgen 工具会查找包含 IL 模式的特殊注释块，并将匹配的 C 代码内联到这些块下方。生成的 C 代码旨在符合 QBE 的编码风格，其工作方式与 1.3 版本之前的手写逻辑类似。有关 mgen 的当前用法，请参阅 isel.c 文件。更详细地说，指令 DAG（有向无环图）是通过遵循类似 Ken Thompson 编译器中的编号方法进行匹配的。然后，mgen 将每个编号与一个位集（bitset）关联，指示当前 IL 节点（临时变量）匹配了哪些顶层用户模式；随后可以通过手写逻辑选择最合适的模式。模式可以包含变量，这些变量可以通过运行匹配器程序来收集。这些程序也由 mgen 生成，使用一种简单的字节码语言，runmatch() 函数可以对其进行解释。我预计未来 mgen 将被用于简化更多后端的指令选择，甚至可能用于在优化过程中识别诸如位旋转之类的 IL 模式。

Nicer

For the 1.3 release, QBE also stung Scott Graham. Scott generously upstreamed his implementation of the Windows ABI, originally found in a fun derivative work. The assembly generated by QBE remains AT&T syntax and is best compiled by the mingw assembler, although I have not tried it myself on Windows. Compiling for Windows is now as simple as passing -t amd64_win to QBE. Last but not least, QBE improved its support for position-independent code and is now able to link smoothly with and even produce shared objects on most targets. The main blocker to date has been the lack of support for indirect access to globals (e.g., global offset table on ELF). This is now possible at the level of IL through the support of a new extern “dynamic constant” flag (DYNCONST in the IL spec). For example, to access a variable dlvar from a dynamically-linked library, one would use function w $load() { @start %v =w load extern $dlvar ret %v } And, in case you wonder, we use the oxymoron “dynamic constant” to speak about address symbols (constant through execution) that can only be known at runtime (dynamic) because they are allocated by either the runtime or the dynamic linker rather than by the regular linking phase of compilation. Thanks for reading this far and happy hacking.

更友好

在 1.3 版本中，QBE 也“叮咬”了 Scott Graham。Scott 大方地将其 Windows ABI 实现贡献到了上游，该实现最初源于一个有趣的衍生项目。QBE 生成的汇编代码仍然保持 AT&T 语法，最好使用 mingw 汇编器进行编译，尽管我本人尚未在 Windows 上尝试过。现在，为 Windows 编译只需向 QBE 传递 -t amd64_win 参数即可。最后但同样重要的是，QBE 改进了对位置无关代码的支持，现在能够在大多数目标平台上顺利链接并生成共享对象。迄今为止，主要的障碍是缺乏对全局变量间接访问的支持（例如 ELF 上的全局偏移表）。现在，通过支持一个新的 extern “动态常量”标志（IL 规范中的 DYNCONST），这在 IL 层面已成为可能。例如，要从动态链接库中访问变量 dlvar，可以使用函数：function w $load() { @start %v =w load extern $dlvar ret %v }。顺便提一下，我们使用“动态常量”这个矛盾修辞法来指代那些地址符号（在执行过程中是常量），但它们只能在运行时（动态）获知，因为它们是由运行时或动态链接器分配的，而不是由常规的编译链接阶段分配的。感谢阅读至此，祝编程愉快。