oops, cubic macro

Oops, Cubic Macro

Last week, I happened to glance at the codebase for rust-analyzer. I realized that r-a and Krabby (my very-very-WIP Rust compiler) have a lot in common; r-a re-implements a lot of rustc and has benchmarks and tests to compare it to rustc. While Krabby has different high-level goals, it needs exactly the same infrastructure. 上周,我偶然浏览了 rust-analyzer 的代码库。我意识到 r-a 和 Krabby(我那个处于极早期开发阶段的 Rust 编译器)有很多共同点;r-a 重新实现了 rustc 的许多部分,并拥有用于与 rustc 进行对比的基准测试和测试用例。虽然 Krabby 的高层目标不同,但它需要完全相同的基础设施。

One component that stuck out to me was macro expansion. I’ve previously heard that r-a puts a lot of effort into macro expansion, partly to offer interesting LSP functionality (e.g. “expand this macro”) and partly because macros complicate normal LSP functionality (e.g. “goto definition”, if the definition is generated by a macro). 其中一个引起我注意的组件是宏展开。我之前听说 r-a 在宏展开上投入了大量精力,部分是为了提供有趣的 LSP 功能(例如“展开此宏”),部分是因为宏使正常的 LSP 功能(例如“跳转到定义”,如果定义是由宏生成的)变得复杂。

While r-a sometimes re-uses code from rustc, a quick peek around the codebase revealed that most of r-a’s macro handling code is written from scratch. It also appears to be much simpler; I’m not sure whether that’s because it has been written more concisely (with hindsight from rustc), because it is less concerned with diagnostics, or because it differs from rustc in some edge cases. 虽然 r-a 有时会重用 rustc 的代码,但快速浏览代码库后发现,r-a 的大部分宏处理代码都是从零开始编写的。它看起来也简单得多;我不确定这是因为它写得更简洁(借鉴了 rustc 的经验教训),还是因为它不太关注诊断信息,亦或是因为它在某些边缘情况下的处理方式与 rustc 不同。

Declarative macro expansion (e.g. foo!(a, b)) has two obvious steps: you need to match the input (a, b) against the match arms in foo (“parsing”); then you need to compute the output by filling in meta-variables in the match arm body (“transcribing”). 声明式宏展开(例如 foo!(a, b))有两个明显的步骤:你需要将输入 (a, b) 与 foo 中的匹配分支进行匹配(“解析”);然后你需要通过填充匹配分支体中的元变量来计算输出(“转录”)。

r-a’s mbe/expander/matcher.rs, which implements the parsing step, has a top-level comment referencing rustc’s mbe/macro_parser.rs. It appears they both use the same algorithm, although I have not fully understood their implementations. The explanation of the algorithm (from mbe/macro_parser.rs) is really interesting: r-a 的 mbe/expander/matcher.rs 实现了解析步骤,其顶部注释引用了 rustc 的 mbe/macro_parser.rs。看起来它们都使用了相同的算法,尽管我还没有完全理解它们的实现。该算法的解释(来自 mbe/macro_parser.rs)非常有趣:

Quick intro to how the parser works: A “matcher position” (a.k.a. “position” or “mp”) is a dot in the middle of a matcher, usually written as a ·. For example · a $( a )* a b is one, as is a $( · a )* a b. The parser walks through the input a token at a time, maintaining a list of threads consistent with the current position in the input string: cur_mps. As it processes them, it fills up eof_mps with threads that would be valid if the macro invocation is now over, bb_mps with threads that are waiting on a Rust non-terminal like $e:expr, and next_mps with threads that are waiting on a particular token. 解析器工作原理简介:一个“匹配器位置”(又称“位置”或“mp”)是匹配器中间的一个点,通常写作 ·。例如 · a $( a )* a b 是一个位置,a $( · a )* a b 也是。解析器一次一个标记地遍历输入,维护一个与输入字符串当前位置一致的线程列表:cur_mps。在处理它们时,它会填充 eof_mps(如果宏调用现在结束则有效的线程)、bb_mps(等待 Rust 非终结符如 $e:expr 的线程)以及 next_mps(等待特定标记的线程)。

Most of the logic concerns moving the · through the repetitions indicated by Kleene stars. The rules for moving the · without consuming any input are called epsilon transitions. It only advances or calls out to the real Rust parser when no cur_mps threads remain. 大部分逻辑涉及在 Kleene 星号指示的重复结构中移动 ·。在不消耗任何输入的情况下移动 · 的规则被称为 epsilon 转换。只有当没有剩余的 cur_mps 线程时,它才会前进或调用真正的 Rust 解析器。

Example: Start parsing a a a a b against [· a $( a )* a b]. 示例:开始解析 a a a a b 对照 [· a $( a )* a b]

Remaining input: a a a a b next: [· a $( a )* a b] 剩余输入:a a a a b next: [· a $( a )* a b]

      • Advance over an a. - - - Remaining input: a a a b cur: [a · $( a )* a b] Descend/Skip (first position). next: [a $( · a )* a b] [a $( a )* · a b]
      • 越过一个 a。 - - - 剩余输入:a a a b cur: [a · $( a )* a b] 下降/跳过(第一个位置)。 next: [a $( · a )* a b] [a $( a )* · a b]
      • Advance over an a. - - - Remaining input: a a b cur: [a $( a · )* a b] [a $( a )* a · b] Follow epsilon transition: Finish/Repeat (first position) next: [a $( a )* · a b] [a $( · a )* a b] [a $( a )* a · b]
      • 越过一个 a。 - - - 剩余输入:a a b cur: [a $( a · )* a b] [a $( a )* a · b] 跟随 epsilon 转换:完成/重复(第一个位置) next: [a $( a )* · a b] [a $( · a )* a b] [a $( a )* a · b]
      • Advance over an a. - - - (this looks exactly like the last step) Remaining input: a b cur: [a $( a · )* a b] [a $( a )* a · b] Follow epsilon transition: Finish/Repeat (first position) next: [a $( a )* · a b] [a $( · a )* a b] [a $( a )* a · b]
      • 越过一个 a。 - - -(这看起来和上一步完全一样) 剩余输入:a b cur: [a $( a · )* a b] [a $( a )* a · b] 跟随 epsilon 转换:完成/重复(第一个位置) next: [a $( a )* · a b] [a $( · a )* a b] [a $( a )* a · b]
      • Advance over an a. - - - (this looks exactly like the last step) Remaining input: b cur: [a $( a · )* a b] [a $( a )* a · b] Follow epsilon transition: Finish/Repeat (first position) next: [a $( a )* · a b] [a $( · a )* a b] [a $( a )* a · b]
      • 越过一个 a。 - - -(这看起来和上一步完全一样) 剩余输入:b cur: [a $( a · )* a b] [a $( a )* a · b] 跟随 epsilon 转换:完成/重复(第一个位置) next: [a $( a )* · a b] [a $( · a )* a b] [a $( a )* a · b]
      • Advance over a b. - - - Remaining input: '' eof: [a $( a )* a b ·]
      • 越过一个 b。 - - - 剩余输入:'' eof: [a $( a )* a b ·]

To me, this looks like a BFS; every possible parse is evaluated simultaneously, over a single pass through the input tokens. My first thought was that a DFS would be better here; it would cover the same ground but require less memory and probably play better with the CPU. 对我来说,这看起来像是一个广度优先搜索(BFS);每一个可能的解析都在对输入标记进行单次遍历的过程中被同时评估。我最初的想法是,深度优先搜索(DFS)在这里会更好;它能覆盖相同的范围,但需要更少的内存,并且可能对 CPU 更友好。

Some prior experience with parsing algorithms (I think memories of packrat parsing) reminded me that a DFS might require caching meta-variables; if the macro requires a Rust expression to be parsed (e.g. $a:expr), we might cache the result for later in the DFS traversal. The cache sounds relatively simple to implement; you’d keep a hash table mapping “parse an expression/statement/etc. at position X” to its result. Every time a meta-variable needs to be matched at some position Y, the hash table would be checked first; on a miss, it would be computed from scratch. The size of the table is bounded quite well. 一些解析算法的过往经验(我想起了 packrat 解析)提醒我,DFS 可能需要缓存元变量;如果宏需要解析一个 Rust 表达式(例如 $a:expr),我们可以在 DFS 遍历中缓存结果以供后续使用。这个缓存听起来实现起来相对简单;你可以维护一个哈希表,将“在位置 X 解析表达式/语句等”映射到其结果。每次需要在位置 Y 匹配元变量时,都会先检查哈希表;如果未命中,则从头计算。表的大小限制得很好。

But I worried about the soundness of this approach; what if the length of the input mattered too? If one arm of a macro required ($e:expr), and the other ($e:expr + 1) (at the same position), a cached result from the first parse could not be used for the second one. This case is sound; Rust prohibits meta-variables from being followed by tokens that they could consume. But ambiguity is a problem. 但我担心这种方法的正确性;如果输入的长度也很重要怎么办?如果宏的一个分支需要 ($e:expr),而另一个需要 ($e:expr + 1)(在相同位置),那么第一次解析的缓存结果就不能用于第二次。这种情况是合理的;Rust 禁止元变量后面紧跟它们可能消耗的标记。但歧义是一个问题。

In rustc’s matcher code, before the explanation of the algorithm, Earley parsers are mentioned — specifically “We don’t say this parser uses the Earley algorithm, because it’s unnecessarily inaccurate.” But the algorithm is a subset of an Earley parser, and as the Wikipedia page mentions, Earley parsers have a quadratic runtime … for unambiguous grammars. For ambiguous grammars, Earley parsers have cubic runtime. Maybe we can force such behaviour out of rustc… 在 rustc 的匹配器代码中,在算法解释之前提到了 Earley 解析器——具体来说是“我们不说这个解析器使用了 Earley 算法,因为它不够准确”。但该算法确实是 Earley 解析器的一个子集,正如维基百科页面所提到的,Earley 解析器在无歧义文法下具有二次方运行时间……对于有歧义的文法,Earley 解析器具有三次方运行时间。也许我们可以强迫 rustc 表现出这种行为……

Around this time, I was talking to jyn, and she linked me to an existing instance of degenerate behaviour in macro expansion. So degenerate cases can occur! Curious whether there were more, I decided to look into ambiguity further. Rust macros can match meta-variables with repetition, e.g. $( $e:expr ),*. This pattern would match a comma-separated sequence of expressions. Unlike (most?) regexes, repetitions in macros are not deterministic; they can match fewer times if needed. 大约在这个时候,我和 jyn 聊了聊,她发给我一个宏展开中退化行为的现有实例。所以退化情况确实会发生!出于好奇是否还有更多,我决定进一步研究歧义。Rust 宏可以匹配带有重复的元变量,例如 $( $e:expr ),*。这种模式会匹配逗号分隔的表达式序列。与(大多数?)正则表达式不同,宏中的重复不是确定性的;如果需要,它们可以匹配更少的次数。

The following example compiles successfully: 以下示例可以成功编译:

macro_rules! foo { ( $( @ )+ ) => {};}
macro_rules! bar { ( $( @ )+ @ ) => {};}
foo!(@ @ @ @); // parsed as '(@ @ @ @)'
bar!(@ @ @ @); // parsed as '(@ @ @) @'

But there are cases of ambiguity here. If we put… 但这里存在歧义的情况。如果我们放入……