RyanCodrai / turbovec

RyanCodrai / turbovec

A 10 million document corpus takes 31 GB of RAM as float32. turbovec fits it in 4 GB - and searches it faster than FAISS. 一个包含 1000 万个文档的语料库如果以 float32 格式存储,需要 31 GB 的内存。而 turbovec 仅需 4 GB 即可容纳,且搜索速度比 FAISS 更快。

turbovec is a Rust vector index with Python bindings, built on Google Research’s TurboQuant algorithm — a data-oblivious quantizer that matches the Shannon lower bound on distortion, with no codebook training and no separate train phase. turbovec 是一个带有 Python 绑定的 Rust 向量索引库,基于 Google Research 的 TurboQuant 算法构建。这是一种数据无关(data-oblivious)的量化器,在失真度上达到了香农下界,且无需码本训练,也没有单独的训练阶段。

Online ingest. Add vectors, they’re indexed — no train step, no parameter tuning, no rebuilds as the corpus grows. Faster than FAISS. Hand-written NEON (ARM) and AVX-512BW (x86) kernels beat FAISS IndexPQFastScan by 12–20% on ARM and match-or-beat it on x86. 支持在线摄入。添加向量即被索引——无需训练步骤,无需参数调优,语料库增长时也无需重建。速度比 FAISS 更快。手写的 NEON (ARM) 和 AVX-512BW (x86) 内核在 ARM 上比 FAISS IndexPQFastScan 快 12–20%,在 x86 上则持平或超越。

Filter at search time. Pass an id allowlist (or a slot bitmask) to search() and the kernel honours it directly. You always get up to k results from the allowed set — no over-fetching, no recall hit on selective filters. 支持搜索时过滤。将 ID 白名单(或槽位位掩码)传递给 search(),内核会直接执行过滤。你始终能从允许的集合中获得最多 k 个结果——没有过度获取,也不会因选择性过滤而导致召回率下降。

Pure local. No managed service, no data leaving your machine or VPC. Pair with any open-source embedding model for a fully air-gapped RAG stack. Building RAG where privacy, memory, or latency matters? You’re in the right place. 纯本地运行。无需托管服务,数据不会离开你的机器或 VPC。搭配任何开源嵌入模型,即可构建完全离线的 RAG 技术栈。如果你正在构建对隐私、内存或延迟有要求的 RAG 应用,这里就是你的理想选择。

Python

pip install turbovec
from turbovec import TurboQuantIndex

index = TurboQuantIndex(dim=1536, bit_width=4)
index.add(vectors)
index.add(more_vectors)
scores, indices = index.search(query, k=10)
index.write("my_index.tq")
loaded = TurboQuantIndex.load("my_index.tq")

Need stable ids that survive deletes? Use IdMapIndex: 需要支持删除操作且 ID 稳定的索引?请使用 IdMapIndex:

import numpy as np
from turbovec import IdMapIndex

index = IdMapIndex(dim=1536, bit_width=4)
index.add_with_ids(vectors, np.array([1001, 1002, 1003], dtype=np.uint64))
scores, ids = index.search(query, k=10) # ids are your uint64 external ids
index.remove(1002) # O(1) by id
index.write("my_index.tvim")
loaded = IdMapIndex.load("my_index.tvim")

混合检索(过滤搜索)

Restrict results to a candidate set produced by another system (SQL, BM25, ACL, time window, …): 将结果限制在由其他系统(SQL、BM25、ACL、时间窗口等)生成的候选集中:

import numpy as np
from turbovec import IdMapIndex

idx = IdMapIndex(dim=1536, bit_width=4)
idx.add_with_ids(vectors, ids)

# Stage 1: external system narrows to candidate ids.
allowed = np.array(db.execute("SELECT id FROM docs WHERE tenant=?", (t,)).fetchall(), dtype=np.uint64)

# Stage 2: dense rerank within the candidate set.
scores, ids = idx.search(query, k=10, allowlist=allowed)

Filtering happens inside the SIMD kernel at 32-vector block granularity: blocks with no allowed slots are short-circuited before any LUT lookup or scoring work, and individual non-allowed slots inside scored blocks are dropped at heap-insert. Selective allowlists (small fraction of the index allowed) therefore avoid most of the SIMD cost rather than paying it and discarding the result afterwards. The output length is min(k, len(allowed)) — when the allowlist is smaller than k you get exactly len(allowed) results rather than padded fallbacks. See docs/api.md for the full reference. 过滤发生在 SIMD 内核内部,以 32 个向量为块进行粒度控制:没有允许槽位的块会在任何 LUT 查找或评分工作之前被短路跳过,而评分块中不被允许的单个槽位会在堆插入时被丢弃。因此,选择性白名单(仅允许索引的一小部分)可以避免大部分 SIMD 开销,而不是先计算再丢弃结果。输出长度为 min(k, len(allowed)) —— 当白名单小于 k 时,你将精确获得 len(allowed) 个结果,而不是填充后的备选结果。完整参考请参阅 docs/api.md。

Framework integrations

框架集成

Drop-in replacements for the in-tree reference vector / document stores in each framework. Same public surface, same persistence semantics, same retriever and pipeline wiring — swap the import and keep your pipeline. 作为各框架内置参考向量/文档存储的直接替代品。相同的公共接口、相同的持久化语义、相同的检索器和管道连接——只需替换导入语句,即可保留原有管道。

  • LangChainpip install turbovec[langchain] · replaces langchain_core.vectorstores.InMemoryVectorStore
  • LlamaIndexpip install turbovec[llama-index] · replaces llama_index.core.vector_stores.SimpleVectorStore
  • Haystackpip install turbovec[haystack] · replaces haystack.document_stores.in_memory.InMemoryDocumentStore
  • Agnopip install turbovec[agno] · replaces agno.vectordb.lancedb.LanceDb

Rust

use turbovec::TurboQuantIndex;

let mut index = TurboQuantIndex::new(1536, 4);
index.add(&vectors);
let results = index.search(&queries, 10);
index.write("index.tv").unwrap();
let loaded = TurboQuantIndex::load("index.tv").unwrap();

For stable external ids that survive deletes: 对于支持删除操作且 ID 稳定的外部 ID:

use turbovec::IdMapIndex;

let mut index = IdMapIndex::new(1536, 4);
index.add_with_ids(&vectors, &[1001, 1002, 1003]);
let (scores, ids) = index.search(&queries, 10);
index.remove(1002);
index.write("index.tvim").unwrap();
let loaded = IdMapIndex.load("index.tvim").unwrap();

Recall

召回率

TurboQuant vs FAISS IndexPQ (LUT256, nbits=8) — the paper’s Section 4.4 baseline. 100K vectors, k=64. FAISS PQ sub-quantizer counts sized to match TurboQuant’s bit rate (m=d/4 at 2-bit, m=d/2 at 4-bit). TurboQuant 对比 FAISS IndexPQ (LUT256, nbits=8) —— 即论文 4.4 节的基准。10 万个向量,k=64。FAISS PQ 子量化器数量调整为与 TurboQuant 的比特率匹配(2-bit 时 m=d/4,4-bit 时 m=d/2)。

Across OpenAI d=1536 and d=3072, TurboQuant beats FAISS by 0.4–3.4 points at R@1 across 2-bit and 4-bit, and both converge to 1.0 by k=4. GloVe d=200 is the harder regime — at low dim the asymptotic Beta assumption is looser. TurboQuant beats FAISS by 0.3 points at 4-bit and trails by 1.2 points at 2-bit at R@1, both closing to FAISS by k≈16. 在 OpenAI d=1536 和 d=3072 的数据集上,TurboQuant 在 2-bit 和 4-bit 配置下的 R@1 指标比 FAISS 高出 0.4–3.4 个百分点,且两者在 k=4 时均收敛至 1.0。GloVe d=200 是更具挑战性的场景——在低维度下,渐近 Beta 假设较为宽松。TurboQuant 在 4-bit 下比 FAISS 高出 0.3 个百分点,在 2-bit 下 R@1 落后 1.2 个百分点,两者在 k≈16 时均与 FAISS 持平。

A note on baselines. We compare against FAISS IndexPQ (LUT256, nbits=8, float32 LUT) because it’s the default production-grade PQ most users would reach for. This is a stronger baseline than the custom u8-LUT PQ in the TurboQuant paper — FAISS uses a higher-precision LUT at scoring time and k-means++ for codebook training. We reproduce the paper’s TurboQuant numbers on OpenAI d=1536 / d=3072 and hit similar numbers to other community reference implementations on low-dim embeddings (see turboquant-py at d=384). The visible gap on GloVe reflects FAISS being a strong baseline, not a TurboQuant implementation issue. 关于基准的说明:我们对比的是 FAISS IndexPQ (LUT256, nbits=8, float32 LUT),因为这是大多数用户在生产环境中默认会选择的 PQ 实现。这是一个比 TurboQuant 论文中自定义 u8-LUT PQ 更强的基准——FAISS 在评分时使用更高精度的 LUT,并使用 k-means++ 进行码本训练。我们在 OpenAI d=1536 / d=3072 上复现了论文中的 TurboQuant 数据,并在低维嵌入上获得了与其他社区参考实现相似的结果(参见 d=384 时的 turboquant-py)。GloVe 上可见的差距反映了 FAISS 是一个强大的基准,而非 TurboQuant 实现的问题。

Search Speed

搜索速度

All benchmarks: 100K vectors, 1K queries, k=64, median of 5 runs. 所有基准测试:10 万个向量,1000 个查询,k=64,取 5 次运行的中位数。

ARM (Apple M3 Max) On ARM, TurboQuant beats FAISS FastScan by 12–20% across every config. 在 ARM 上,TurboQuant 在所有配置下均比 FAISS FastScan 快 12–20%。

x86 (Intel Xeon Platinum 8481C / Sapphire Rapids, 8 vCPUs) On x86, TurboQuant wins every 4-bit config by 1–6% and runs within ~1% of FAISS on 2-bit ST. The 2-bit MT rows (d=1536 and d=3072) are the only configs sitting slightly behind FAISS (2–4%), where the inner accumulate loop is too short for unrolling amortization to match FAISS’s AVX-512 VBMI path. 在 x86 上,TurboQuant 在所有 4-bit 配置下领先 1–6%,在 2-bit 单线程(ST)下与 FAISS 差距在 1% 以内。2-bit 多线程(MT)行(d=1536 和 d=3072)是仅有的略微落后于 FAISS(2–4%)的配置,因为其内部累加循环太短,无法通过展开摊销来匹配 FAISS 的 AVX-512 VBMI 路径。

How it works

工作原理

Each vector is a direction on a high-dimensional hypersphere. TurboQuant compresses these directions using a simple… 每个向量都是高维超球面上的一条方向。TurboQuant 使用一种简单的…(原文截断)