bijou64: A variable-length integer encoding

bijou64：一种变长整数编码

It’s nice when you work on security and accidentally get some performance for free. This is the story of a small encoding called bijou64 — a variable-length integer (varint) encoding that we developed for the Subduction CRDT sync protocol. 当你致力于安全工作却意外获得性能提升时，感觉总是很棒。这就是关于一种名为 bijou64 的小型编码的故事——这是我们为 Subduction CRDT 同步协议开发的一种变长整数（varint）编码。

It was intended to fix a subtle signature-verification bug by making each number only representable a single way. It turned out to also run a few times faster than the more common varint LEB128. We didn’t set out to write a fast varint, but it turns out that our design constraints made for an encoding that has to do less work. 它的初衷是通过确保每个数字只有唯一的表示方式，来修复一个隐蔽的签名验证漏洞。结果发现，它的运行速度也比常见的 LEB128 变长编码快了几倍。我们并非刻意追求编写一种快速的变长编码，但事实证明，我们的设计约束使得该编码所需处理的工作量更少。

The Problem

问题所在

Many binary protocols need a compact way to encode integers that are usually small but occasionally large. Variable-length integer encodings (“varints”) solve this, but most designs treat canonicality as an afterthought — something enforced by a runtime check in the decoder rather than by the structure of the encoding itself. 许多二进制协议需要一种紧凑的方式来编码通常较小、偶尔较大的整数。变长整数编码（“varints”）解决了这个问题，但大多数设计将“规范性”（canonicality）视为事后补救措施——即通过解码器中的运行时检查来强制执行，而不是通过编码本身的结构来保证。

Since it’s the most common varint, we’re going to pick on LEB128 a bit here. I want to emphasize how much LEB128 is a great choice for many projects, and the reasons that it was not a good choice for us also applies to the other formats that we looked at. It just happened to not be a perfect fit for our use case. 由于 LEB128 是最常见的变长编码，我们在这里要稍微“挑剔”一下它。我想强调的是，LEB128 对于许多项目来说仍然是一个极佳的选择，而它不适合我们的原因同样适用于我们研究过的其他格式。它只是恰好不完全契合我们的使用场景。

LEB128 encodes a number as a sequence of 7-bit segments, with the high bit of each byte signaling “more bytes follow” (there can be many such continuation segments, though we only show 2 segments below). This lets you avoid always writing 8 bytes (64 bits) even when you’re representing a small number (which would be mostly zeroes). This is like writing 5 instead of 000000005 just to get the correct number of characters. Putting aside that working in 7-bit is odd, this is a practical solution! LEB128 将数字编码为一系列 7 位片段，每个字节的高位用于指示“后续还有更多字节”（这种延续片段可以有很多，尽管我们在下方仅展示了 2 个）。这让你在表示小数字（通常大部分位为零）时，不必总是写入 8 个字节（64 位）。这就像写“5”而不是“000000005”来获得正确的字符数一样。撇开 7 位处理方式的怪异不谈，这确实是一个实用的解决方案！

But there’s a problem: the number 0 can be encoded as the single byte 0x00, but it can also be encoded as 0x80 0x00. Or 0x80 0x80 0x00. Or any longer sequence of 0x80s ending in a zero byte. 0x80 is 1 0000000, so you can have as many of those as you want and still get 0! Most LEB128 decoders will happily accept any of them. This isn’t unique to zero; nearly every number in LEB128 can be represented many ways. 但这里有个问题：数字 0 可以编码为单个字节 0x00，但也可以编码为 0x80 0x00，或者 0x80 0x80 0x00，甚至是任何以零字节结尾的 0x80 长序列。0x80 是 1 0000000，所以你可以随心所欲地添加这些字节，结果依然是 0！大多数 LEB128 解码器会欣然接受其中任何一种。这并非 0 所独有；在 LEB128 中，几乎每个数字都可以有多种表示方式。

This causes problems for signed data if you ever want to do things like compression since you need to know the exact bytes that were signed. Having an extra 0x80 will result in a different signature. If you only have one unique way to represent a number, you can do things like deduplicate runs of numbers when storing without needing to retain the entire original. 如果你需要对签名数据进行压缩等操作，这会引发问题，因为你需要知道被签名的确切字节。多出一个 0x80 会导致签名结果不同。如果你只有一种唯一的数字表示方式，你就可以在存储时对数字序列进行去重，而无需保留原始的完整数据。

Canonicalisation

规范化

One solution is to simply enforce a special “canonical” form. While encoding a varint, you must ensure that you use the canonical encoding (not all libraries will always do this). When decoding, you must validate that it matches your expected format. It would be really nice to not have to do that additional work. 一种解决方案是简单地强制执行一种特殊的“规范”形式。在编码变长整数时，你必须确保使用规范编码（并非所有库都会这样做）。在解码时，你必须验证它是否符合预期的格式。如果不需要做这些额外的工作，那将是非常好的。

So What?

那又怎样？

You could reasonably ask: who actually shows up with adversarial varints? The answer is “anyone who would benefit from your protocol mistaking two different byte strings for the same value.” For a signed protocol, that could be a lot of people. 你可能会合理地问：谁会真的使用对抗性的变长编码？答案是：“任何能从你的协议将两个不同的字节串误认为相同值中获益的人。”对于签名协议来说，这可能涉及很多人。

While not about varints per se, the textbook case is ASN.1 (the abstract syntax notation behind X.509 certificates, LDAP, and many other things that are widely depended on). Canonicalisation attacks have been used against PKCS#1 v1.5, Mozilla NSS, GnuTLS, JWTs and Bitcoin transactions. 虽然这不完全是关于变长编码，但教科书式的案例是 ASN.1（X.509 证书、LDAP 以及许多其他被广泛依赖的技术背后的抽象语法标记）。规范化攻击已被用于针对 PKCS#1 v1.5、Mozilla NSS、GnuTLS、JWT 和比特币交易。

The pattern in all of these is something like: The spec says: “the canonical encoding is X; any other encoding MUST be rejected.” The implementation has one or more ifs that enforces this. The check is separately deletable from the rest of the parser. Removing it doesn’t break round-trip tests; it doesn’t break tests using honestly-encoded data; it doesn’t break performance benchmarks. It only breaks under adversarial input, which is rarely in the test suite. The check is forgotten, optimised away, or never ported. The protocol’s security property silently degrades. 所有这些案例的模式大致如下：规范声明：“规范编码是 X；任何其他编码必须被拒绝。”实现代码中有一个或多个 if 语句来强制执行此规则。该检查可以从解析器的其余部分中独立删除。删除它不会破坏往返测试；不会破坏使用诚实编码数据的测试；也不会破坏性能基准测试。它只会在对抗性输入下失效，而这在测试套件中很少见。于是该检查被遗忘、被优化掉或从未被移植。协议的安全属性便悄然退化。

This is the bug class bijou64 is designed to make impossible. Not by adding more checks, but by removing the one that mattered — and making the format such that, with no canonicality check at all, the only encoding that exists for any given value is the canonical one. 这就是 bijou64 旨在消除的漏洞类别。它不是通过增加更多的检查，而是通过移除那个关键的检查——并使格式本身达到这样一种状态：即使完全没有规范性检查，对于任何给定的值，也只存在唯一的编码方式。

(Almost) Canonical by Construction

（几乎）构造即规范

bijou64 eliminates the possibility of more than one encoding per integer. Just like our normal written number system, there is exactly one way to write each number. Bijou uses two tricks: bijou64 消除了每个整数存在多种编码的可能性。就像我们正常的书写数字系统一样，每个数字只有一种书写方式。Bijou 使用了两个技巧：

First Byte Double Duty: The first byte represents 0–247 as normal. If you get 0x42, it just decodes to 0x42. 248–255 switch into a different mode: they’re a tag for how many bytes to expect after this first one, which will represent the number. This is really nice for decoding, because we know how much memory to allocate as soon as we read the first byte (O(1)). LEB128, by contrast, has to keep reading bytes until it sees one without the continuation bit set (O(n)).
首字节双重职责：第一个字节正常表示 0–247。如果你得到 0x42，它直接解码为 0x42。248–255 切换到另一种模式：它们是一个标签，指示在此首字节之后应预期多少个字节来表示该数字。这对解码非常友好，因为我们在读取第一个字节后立即就知道需要分配多少内存（O(1)）。相比之下，LEB128 必须不断读取字节，直到看到一个没有设置延续位的字节为止（O(n)）。
Offsets: The tag alone is not enough to guarantee canonicity, but it points the way: instead of repeating 0-247 in the second byte (0xF8 0x00 == 0x00), we offset the next byte by 248 (0xF8). This means that 0xF8 0x00 == 0xF8 == 248, not 0 (since 0 is already represented as 0x00).
偏移量：仅靠标签不足以保证规范性，但它指明了方向：我们不在第二个字节中重复 0-247（即 0xF8 0x00 == 0x00），而是将下一个字节偏移 248 (0xF8)。这意味着 0xF8 0x00 == 0xF8 == 248，而不是 0（因为 0 已经由 0x00 表示）。

(Note: The article continues with technical examples and tables, but the core explanation of the bijou64 design is provided above.) （注：原文后续包含技术示例和表格，但 bijou64 设计的核心解释已如上所述。）