Qdrant TurboQuant Explained: Is TurboQuant the Silver Bullet?

Qdrant TurboQuant Explained: Is TurboQuant the Silver Bullet?

Qdrant TurboQuant 解析:TurboQuant 是“银弹”吗?

Large Language Models Qdrant TurboQuant Explained: Is TurboQuant the Silver Bullet? Most engineers see quantization as shrinking vectors. TurboQuant asks a harder question: can you shrink them without breaking their geometry? 大型语言模型 Qdrant TurboQuant 解析:TurboQuant 是“银弹”吗?大多数工程师将量化视为压缩向量的过程。而 TurboQuant 提出了一个更具挑战性的问题:能否在不破坏向量几何结构的前提下对其进行压缩?

Most engineers view quantization as a tradeoff between memory and recall. The standard is Float32 with high fidelity and high memory cost. The basic solution is scalar quantization, which reduces each value to fewer bits (around 4× compression) with a slight recall loss. Although binary quantization pushes much harder, often reaching 32× compression, the retrieval result might become inconsistent due to information loss. On the other hand, product quantization may be more efficient, but it is harder to tune and operate in real production. 大多数工程师将量化视为内存与召回率之间的权衡。标准做法是使用 Float32,它具有高保真度,但内存成本高昂。基础解决方案是标量量化(Scalar Quantization),它将每个数值减少到更少的位数(约 4 倍压缩),并伴随轻微的召回率损失。虽然二值量化(Binary Quantization)压缩力度更大,通常能达到 32 倍压缩,但由于信息丢失,检索结果可能会变得不稳定。另一方面,乘积量化(Product Quantization)可能更高效,但在实际生产环境中更难调优和运维。

In early May of 2026, Qdrant released TurboQuant, a new quantization method. And they claimed that “TurboQuant can reduce memory use without making retrieval quality too unstable“. TurboQuant sounds like the kind of feature vector search teams want. However, I wondered whether TurboQuant still holds up when we test it across different dataset sizes. Does it give a real improvement over common quantization methods, or does its advantage depend on the data? 2026 年 5 月初,Qdrant 发布了一种新的量化方法——TurboQuant。他们声称:“TurboQuant 可以在降低内存使用的同时,保持检索质量的稳定性。” TurboQuant 听起来正是向量搜索团队所追求的功能。然而,我很好奇在不同规模的数据集上测试时,TurboQuant 是否依然表现出色?它相比常见的量化方法是否有实质性的提升,还是说其优势取决于数据本身?

I ran experiments to compare it with more familiar quantization methods such as scalar and binary quantization. The goal was to understand where TurboQuant is useful, where it is risky, and whether it can be treated as a serious default option for vector search. I believe that this will help engineers, ML practitioners, and vector database users understand where TurboQuant fits compared with more common quantization methods, especially when moving from experiments to production. 我进行了一些实验,将其与标量量化和二值量化等更常用的方法进行了对比。我的目标是了解 TurboQuant 的适用场景、潜在风险,以及它是否可以作为向量搜索的默认首选方案。我相信,这将帮助工程师、机器学习从业者和向量数据库用户理解 TurboQuant 在面对常见量化方法时的定位,特别是在从实验环境转向生产环境时。

1. What is Quantization?

1. 什么是量化?

Every float32 number in a vector uses 4 bytes. As a result, a 1536-dimension embedding takes 6 KB per vector; at a million vectors, the database takes up to 6 GB just for the index. This is when we need Quantization. Quantization shrinks each number in a vector to a smaller byte number. The standard approach is Scalar quantization. It starts with finding the min and max across each dimension. Then, that range is divided into 255 equal bins. Every value in the vector is rounded to the nearest bin, and the bin number is stored as a single byte instead of four. The original Float32 embedding now becomes a uint8 embedding at 4x compression, meaning 4 times smaller in storage size. 向量中的每个 float32 数值占用 4 个字节。因此,一个 1536 维的嵌入向量每个占用 6 KB;若有 100 万个向量,仅索引部分数据库就需要占用 6 GB 空间。这时我们就需要量化。量化将向量中的每个数值压缩为更小的字节数。标准方法是标量量化,它首先找出每个维度的最小值和最大值,然后将该范围划分为 255 个等分区间。向量中的每个值都被四舍五入到最近的区间,并以单字节(而非 4 字节)存储区间编号。原始的 Float32 嵌入现在变成了 uint8 嵌入,实现了 4 倍压缩,即存储空间缩小了 4 倍。

The tiny error (quantization error) accumulates across all dimensions during dot product computation. The tiny error in the last row is called quantization error, and it accumulates across 6 dimensions of the vector during dot product computation. This is what makes similarity scores slightly wrong. However, there are more aggressive compressions such as 8x (4-bit), 16x (2-bit), or 32x (1-bit). The more the compression, the smaller the vector size, and the bigger the error from the original one. 在点积计算过程中,微小的误差(量化误差)会在所有维度上累积。最后一行中的微小误差被称为量化误差,它在点积计算时会跨越向量的 6 个维度进行累积,这就是导致相似度分数出现轻微偏差的原因。当然,还有更激进的压缩方式,如 8 倍(4-bit)、16 倍(2-bit)或 32 倍(1-bit)。压缩率越高,向量体积越小,但与原始向量的误差也就越大。

2. The Real Question is Not Compression Ratio

2. 真正的问题不在于压缩比

The real question is: what vector geometry remains after compression? Traditional quantizers, in most cases, directly compress the vector. Scalar quantization applies the same fixed grid to every dimension, whether that dimension contains a useful signal or noise. Binary quantization keeps only the sign bit. Therefore, neither method first checks whether some dimensions carry more signal than others. 真正的问题在于:压缩后还保留了多少向量几何结构?在大多数情况下,传统的量化器直接对向量进行压缩。标量量化对每个维度应用相同的固定网格,无论该维度包含的是有用信号还是噪声。二值量化则仅保留符号位。因此,这两种方法都没有预先检查某些维度是否比其他维度携带更多的信号。

Qdrant 1.18 changes this pattern with the new TurboQuant integrated. Based on a Google Research algorithm presented at ICLR 2026, TurboQuant rotates the vector before compression. This random rotation spreads variance more evenly across dimensions, so each bit can preserve more useful information. TurboQuant is not better because it uses fewer bits. It is better because it makes the vector easier to compress before spending those bits. Qdrant 1.18 通过集成全新的 TurboQuant 改变了这一模式。基于 ICLR 2026 上发表的一项 Google Research 算法,TurboQuant 在压缩前会对向量进行旋转。这种随机旋转使方差更均匀地分布在各个维度上,从而使每一位(bit)都能保留更多有用的信息。TurboQuant 的优势不在于它使用的位数更少,而在于它在进行压缩前,让向量变得更容易被压缩。

3. TurboQuant in Short: Rotate First, Compress Second

3. TurboQuant 简述:先旋转,后压缩

Every vector in an embedding model has structure. A 1536-dimensional embedding might carry most of its useful signal in only a small subset of coordinates. The remaining dimensions often contribute much less, but they still appear in every vector, which adds noise and makes distance comparisons less reliable. 嵌入模型中的每个向量都具有结构。一个 1536 维的嵌入向量,其大部分有用信号可能仅集中在少数几个坐标子集中。其余维度贡献较小,但它们依然存在于每个向量中,这增加了噪声,并降低了距离比较的可靠性。

3.1 The TurboQuant Pipeline: The idea is simple. Before compressing, spin the vector through a random orthogonal rotation. That rotation does not change distances - it just redistributes energy so every dimension carries roughly the same amount of information. Then, a single precomputed codebook is applied to. 3.1 TurboQuant 流水线:其核心思想很简单。在压缩之前,通过随机正交旋转对向量进行“旋转”。这种旋转不会改变距离,它只是重新分配了能量,使得每个维度携带的信息量大致相同。随后,再应用一个预先计算好的码本(codebook)进行处理。