How a 2021 Quantization Algorithm Quietly Outperforms Its 2026 Successor

How a 2021 Quantization Algorithm Quietly Outperforms Its 2026 Successor

2021年的量化算法如何悄然超越其2026年的继任者

One scale parameter determines accuracy in rotation-based vector quantization. 一个缩放参数决定了基于旋转的向量量化的精度。

TurboQuant [3], an online vector quantization method, drew wide public attention at ICLR 2026. For me, it looked very familiar: it overlaps heavily with EDEN, a quantization method first introduced as the 1-bit method DRIVE at NeurIPS 2021 [1] and generalized to arbitrary bit-widths at ICML 2022 [2]. Co-authored by myself, with Ran Ben-Basat, Yaniv Ben-Itzhak, Gal Mendelson, Michael Mitzenmacher, and Shay Vargaftik. TurboQuant [3] 是一种在线向量量化方法,在 ICLR 2026 上引起了广泛关注。对我而言,它看起来非常眼熟:它与 EDEN 有着高度重合。EDEN 是一种量化方法,最初作为 1-bit 方法 DRIVE 在 NeurIPS 2021 [1] 上提出,并于 ICML 2022 [2] 推广至任意位宽。该方法由我与 Ran Ben-Basat、Yaniv Ben-Itzhak、Gal Mendelson、Michael Mitzenmacher 和 Shay Vargaftik 共同撰写。

The TurboQuant paper presents two variants: TurboQuant-mse and TurboQuant-prod. In a detailed new comparison [5] we show that TurboQuant-mse is a degenerate case of EDEN, and that the EDEN variants consistently outperform their counterparts. TurboQuant 论文提出了两个变体:TurboQuant-mse 和 TurboQuant-prod。在一项详细的新对比研究 [5] 中,我们证明了 TurboQuant-mse 是 EDEN 的一个退化情况,且 EDEN 的变体在性能上始终优于其对应版本。

How EDEN quantizes a vector

EDEN 如何量化向量

Suppose you need to compress a $d$-dimensional vector $x$ (a gradient update, an embedding, a KV-cache entry) down to a few bits per coordinate. EDEN proceeds in four steps: 假设你需要将一个 $d$ 维向量 $x$(如梯度更新、嵌入或 KV-cache 条目)压缩到每个坐标仅占几位。EDEN 的处理过程分为四步:

  1. Random rotation — Multiply by a random orthogonal matrix $\Pi$. After rotation the coordinates are identically distributed and, for large $d$, approximately Gaussian.

  2. 随机旋转 — 乘以一个随机正交矩阵 $\Pi$。旋转后,各坐标呈同分布,且当 $d$ 较大时,近似服从高斯分布。

  3. Scalar quantization — Round each rotated coordinate to one of $2^b$ levels from a Lloyd–Max codebook trained on the known rotated coordinate distribution ($b$ is the target number of bits per coordinate).

  4. 标量量化 — 根据已知的旋转坐标分布训练出的 Lloyd–Max 码本,将每个旋转后的坐标舍入到 $2^b$ 个级别之一($b$ 为每个坐标的目标位数)。

  5. Scale — Multiply by a scale factor $S$.

  6. 缩放 — 乘以一个缩放因子 $S$。

  7. Inverse rotation — Apply $\Pi^\top$ to recover an approximation $\hat{x}$ of the original vector.

  8. 逆旋转 — 应用 $\Pi^\top$ 以恢复原始向量的近似值 $\hat{x}$。

While earlier works (e.g., Suresh et al. (2017) [6]) used rotation mainly to shrink the coordinates’ dynamic range (the gap between the largest and smallest coordinate value), EDEN [1] was, to the best of our knowledge, the first quantization scheme to exploit a stronger fact about random rotation: the post-rotation coordinates follow a known distribution, which lets us use a deterministic quantizer paired with a closed-form scale that, depending on the application, either minimizes MSE or makes the estimate unbiased. Both scales are derived analytically, and the construction yields an asymptotic MSE reduction over the previous approach. 虽然早期的研究(如 Suresh 等人 (2017) [6])主要利用旋转来缩小坐标的动态范围(最大值与最小值之间的差距),但据我们所知,EDEN [1] 是第一个利用随机旋转更强特性的量化方案:旋转后的坐标遵循已知分布,这使我们能够使用确定性量化器,并配合一个闭式缩放因子。根据应用场景的不同,该因子既可以最小化均方误差 (MSE),也可以使估计值无偏。两种缩放因子均通过解析推导得出,且该结构在渐近 MSE 上优于以往的方法。

Concretely, EDEN’s two variants differ only in the choice of $S$: 具体而言,EDEN 的两个变体仅在 $S$ 的选择上有所不同:

  • EDEN-biased — sets $S$ to the closed-form value that minimizes the reconstruction MSE.

  • EDEN-biased — 将 $S$ 设置为最小化重构 MSE 的闭式值。

  • EDEN-unbiased — chooses $S$ so the decompressed output is correct on average ($\mathbb{E}[\hat{x}] = x$), which matters particularly whenever you average many quantized vectors (e.g., distributed training, attention).

  • EDEN-unbiased — 选择 $S$ 以确保解压后的输出在平均意义上是正确的 ($\mathbb{E}[\hat{x}] = x$),这在需要对多个量化向量求平均值时(例如分布式训练、注意力机制)尤为重要。

Lined up against EDEN, TurboQuant-mse matches at every step except one: where EDEN derives the scale $S$ analytically, TurboQuant-mse, although it targets MSE minimization, skips the optimized scaling. 与 EDEN 相比,TurboQuant-mse 在除一步之外的所有步骤中都保持一致:EDEN 通过解析推导得出缩放因子 $S$,而 TurboQuant-mse 虽然目标也是最小化 MSE,却跳过了这种优化缩放。

Why the optimal scale is worth it

为什么最优缩放至关重要

The value of applying proper scale $S$ grows with bit-width. At $b=1$ bit, the gap is marginal. At $d=128$ and $b=4$ bits, EDEN-biased reduces MSE by 2.25% over TurboQuant-mse, and these are the bit-widths practitioners actually use for embeddings and KV caches. Across dimensions 16 to 4096 and all tested bit-widths $b \in {1,2,3,4}$, EDEN-biased vNMSE (vector-normalized MSE, $\mathbb{E}[|x - \hat{x}|^2] / |x|^2$) falls below TurboQuant-mse’s in every case. As dimension grows very large, the optimal $S$ approaches 1 and the two algorithms converge, but at practical dimensions (128–1024), the gap persists. 应用合适的缩放因子 $S$ 的价值随位宽增加而增长。在 $b=1$ bit 时,差距微乎其微。但在 $d=128$ 且 $b=4$ bits 时,EDEN-biased 比 TurboQuant-mse 的 MSE 降低了 2.25%,而这正是从业者在嵌入和 KV 缓存中实际使用的位宽。在 16 到 4096 的维度范围以及所有测试的位宽 $b \in {1,2,3,4}$ 下,EDEN-biased 的 vNMSE(向量归一化 MSE,$\mathbb{E}[|x - \hat{x}|^2] / |x|^2$)在所有情况下均低于 TurboQuant-mse。随着维度变得非常大,最优的 $S$ 趋近于 1,两种算法趋于一致,但在实际维度(128–1024)下,差距依然存在。

Unbiased compression: saving more than a full bit

无偏压缩:节省超过一个比特

The results above concern the biased (MSE-minimizing) variants. Now consider the unbiased case, where applications such as distributed training, approximate attention, or inner-product retrieval need $\mathbb{E}[\hat{x}] = x$ because they average many quantized vectors. EDEN-unbiased uses the same single-pass algorithm as EDEN-biased, just with $S$ chosen for bias correction. TurboQuant’s unbiased variant, TurboQuant-prod, takes a different route: it spends $(b-1)$ bits on the biased TurboQuant-mse step and reserves 1 bit for a QJL (Quantized Johnson–Lindenstrauss) [4] correction on the residual. 上述结果涉及有偏(最小化 MSE)变体。现在考虑无偏情况,分布式训练、近似注意力机制或内积检索等应用需要 $\mathbb{E}[\hat{x}] = x$,因为它们需要对多个量化向量求平均。EDEN-unbiased 使用与 EDEN-biased 相同的单遍算法,只是 $S$ 的选择用于偏差校正。TurboQuant 的无偏变体 TurboQuant-prod 采取了不同的路径:它在有偏的 TurboQuant-mse 步骤上花费 $(b-1)$ 位,并保留 1 位用于残差的 QJL(量化 Johnson–Lindenstrauss)[4] 校正。

EDEN-unbiased outperforms TurboQuant-prod in every tested configuration, and by a substantial margin. The gap traces to three structural advantages of EDEN’s single-pass design: EDEN-unbiased 在所有测试配置中均大幅优于 TurboQuant-prod。这种差距源于 EDEN 单遍设计的三个结构性优势:

  1. EDEN optimizes the scale. TurboQuant-prod inherits TurboQuant-mse’s $S=1$ first stage, so it carries the same MSE penalty.

  2. EDEN 优化了缩放。 TurboQuant-prod 继承了 TurboQuant-mse 的 $S=1$ 第一阶段,因此带有相同的 MSE 惩罚。

  3. EDEN’s 1-bit construction has lower variance than QJL. In large dimensions, EDEN’s 1-bit vNMSE converges to $\pi/2 - 1 \approx 0.57$ [1], while QJL’s converges to $\pi/2 \approx 1.57$ [4], roughly 2.75× higher.

  4. EDEN 的 1-bit 结构比 QJL 方差更低。 在大维度下,EDEN 的 1-bit vNMSE 收敛于 $\pi/2 - 1 \approx 0.57$ [1],而 QJL 收敛于 $\pi/2 \approx 1.57$ [4],高出约 2.75 倍。

  5. EDEN spends the full bit budget on a single unbiased quantizer. TurboQuant-prod splits the budget into $(b-1)$ biased bits plus 1 residual bit, which empirically underperforms spending all $b$ bits on a single unbiased quantizer [5].

  6. EDEN 将全部比特预算用于单个无偏量化器。 TurboQuant-prod 将预算拆分为 $(b-1)$ 个有偏位和 1 个残差位,实验证明这不如将所有 $b$ 位用于单个无偏量化器的效果好 [5]。

These effects compound. The result: 1-bit, 2-bit, and 3-bit EDEN-unbiased are each more accurate than 2-bit, 3-bit, and 4-bit TurboQuant-prod, respectively. By swapping in EDEN you can drop a bit per coordinate and still match TurboQuant-prod’s accuracy. 这些效应叠加在一起。结果是:1-bit、2-bit 和 3-bit 的 EDEN-unbiased 分别比 2-bit、3-bit 和 4-bit 的 TurboQuant-prod 更精确。通过切换到 EDEN,你可以在每个坐标上减少一个比特,同时仍能达到 TurboQuant-prod 的精度。