Regular expressions that work “everywhere”

Regular expressions that work “everywhere”

在“任何地方”都能运行的正则表达式

The most frustrating aspect of regular expressions is that implementations vary. Features supported in one tool may not be supported at all in another tool, or they may be supported with slightly different syntax. 正则表达式最令人沮丧的一点在于,不同工具的实现方式各不相同。在一个工具中支持的功能,在另一个工具中可能完全不支持,或者语法略有差异。

I learned regular expressions in the context of Perl, a maximalist regex environment. This led to frustration when features I expect to work are missing [1]. One way around this is to use Perl analogs of other tools, but this is very non-standard. I want to be able to send colleagues and clients code that works out of the box. 我是在 Perl 的语境下学习正则表达式的,这是一个功能极其丰富的正则环境。当我在其他地方发现预期的功能缺失时,往往会感到很挫败 [1]。解决这个问题的一种方法是使用其他工具的 Perl 模拟版本,但这非常不规范。我希望能够向同事和客户发送开箱即用的代码。

As I mentioned in my post on computational survivalism, I occasionally need to work on computers that I cannot install software on. So a better approach is to identify a subset of regex features that work everywhere. The stricter your definition of “everywhere” the less this includes. The strictest subset would be literals, character classes, and the special characters: ., *, ^, $. 正如我在关于“计算生存主义”的文章中所提到的,我偶尔需要在无法安装软件的计算机上工作。因此,更好的方法是确定一个在任何地方都能运行的正则表达式子集。你对“任何地方”的定义越严格,包含的功能就越少。最严格的子集将包括:字面量、字符类以及特殊字符 .*^$

A more relaxed definition of “everywhere” would be the tools you most care about. Currently the tools I most want to use with regular expressions are sed, awk, grep, and Emacs. 对“任何地方”更宽松的定义,是指你最关心的那些工具。目前,我最希望在正则表达式中使用的工具是 sed、awk、grep 和 Emacs。

Awk as lowest common denominator

以 Awk 作为最小公分母

If you use the Gnu versions of sed, awk, and grep, and use the -E option with sed and grep, then the list of common features is bigger. The regular expression features of the three tools are similar, and awk’s features are supported in the other tools, with one exception: word boundaries in awk are \< and \> rather than \b and \B. I wrote about Awk’s regex features here. 如果你使用 GNU 版本的 sed、awk 和 grep,并在 sed 和 grep 中使用 -E 选项,那么通用功能的列表会更长。这三个工具的正则表达式功能非常相似,且 awk 的功能在其他工具中也得到支持,只有一个例外:awk 中的单词边界是 \<\>,而不是 \b\B。我曾在这里写过关于 Awk 正则表达式功能的文章。

Emacs as the oddball

特立独行的 Emacs

Emacs supports analogs of most of awk’s regex features. However, the characters +, ?, (, ), {, }, | all require a backslash in front in order to act like the awk counterparts. Also, the analog of \s and \S in awk is \s- and \S- in Emacs. Instead of meaning space or nonspace, \s and \S in Emacs begin a (negated) character class, and one of those classes is - for space. But there are many others. For example, \s. stands for a punctuation character and \S. stands for a non-punctuation character. Emacs 支持大多数 awk 正则表达式功能的模拟。然而,字符 +?(){}| 在 Emacs 中都需要在前面加上反斜杠,才能起到与 awk 中相同的作用。此外,awk 中 \s\S 的对应项在 Emacs 中是 \s-\S-。在 Emacs 中,\s\S 并不代表空格或非空格,而是开启一个(取反的)字符类,其中 - 类代表空格。但还有许多其他类,例如 \s. 代表标点符号,\S. 代表非标点符号。

What works everywhere

哪些功能在任何地方都适用

So for my definition of “everywhere,” with the caveats mentioned above, the following features work everywhere. YMMV (Your mileage may vary). 因此,按照我对“任何地方”的定义,并结合上述注意事项,以下功能在任何地方都能运行。当然,具体情况可能因环境而异。

  • .
  • ^, $
  • [...], [^...]
  • *
  • \w, \W, \s, \S
  • \1 - \9 (backreferences / 后向引用)
  • \b, \B
  • ?, +, | (alternation / 多选)
  • {n,m} (counting matches / 计数匹配)
  • (...) (capturing / 捕获)

One footnote is that gawk supports backreferences in replacement strings but not in regular expressions per se. 注:gawk 在替换字符串中支持后向引用,但在正则表达式本身中不支持。


[1] To some extent, basic Perl features work elsewhere and advanced features do not, depending on your idea of what is basic or advanced. I think of look-around features as advanced, and that tracks. But I think of \d for digits as basic, but that’s not supported in many regex flavors. [1] 在某种程度上,基础的 Perl 功能在其他地方也能运行,而高级功能则不行,这取决于你对“基础”或“高级”的定义。我认为“环视”(look-around)功能属于高级功能,这很合理。但我认为用于匹配数字的 \d 属于基础功能,然而许多正则变体并不支持它。