Building a RAG System from Scratch — Design Decisions Explained
Building a RAG System from Scratch — Design Decisions Explained
从零构建 RAG 系统——设计决策详解
In the previous article, we built a working RAG pipeline. Now let’s step back and ask why we made each design decision — and what alternatives exist when your requirements change. 在上一篇文章中,我们构建了一个可运行的 RAG 流水线。现在,让我们退一步思考:为什么我们做出了这些设计决策?当你的需求发生变化时,又有哪些替代方案?
The Full Picture
全景概览
Here’s what we built: 这是我们构建的系统架构:
-
Ingest phase: Text → gemini-embedding-001 (RETRIEVAL_DOCUMENT, 768 dims) → pgvector (HNSW index, cosine similarity)
-
Query phase: Question → gemini-embedding-001 (RETRIEVAL_QUERY, 768 dims) → pgvector search (top-k) → Gemini 2.5 Flash (answer generation)
-
摄入阶段: 文本 → gemini-embedding-001 (RETRIEVAL_DOCUMENT, 768 维) → pgvector (HNSW 索引,余弦相似度)
-
查询阶段: 问题 → gemini-embedding-001 (RETRIEVAL_QUERY, 768 维) → pgvector 搜索 (top-k) → Gemini 2.5 Flash (答案生成)
Every element in this diagram was a choice. Let’s examine each one. 图中的每一个元素都是经过选择的。让我们逐一分析。
Decision 1: pgvector over a Dedicated Vector DB
决策 1:选择 pgvector 而非专用向量数据库
We used pgvector, a PostgreSQL extension, rather than a purpose-built vector database like Pinecone, Weaviate, or Qdrant. 我们使用了 PostgreSQL 的扩展插件 pgvector,而不是 Pinecone、Weaviate 或 Qdrant 等专用向量数据库。
Why pgvector works here: 为什么 pgvector 在此适用:
- Integrates with existing PostgreSQL infrastructure — no new service to operate.
- SQL and vector search in the same query: filter by category, join with other tables, all in one round-trip.
- Handles millions of documents comfortably with HNSW indexing.
- 与现有的 PostgreSQL 基础设施集成,无需运维新服务。
- 在同一查询中结合 SQL 和向量搜索:按类别过滤、与其他表联接,一次往返即可完成。
- 通过 HNSW 索引,可以轻松处理数百万份文档。
When to consider a dedicated vector DB: 何时考虑专用向量数据库:
- Signal: Consider moving to > 10M documents (Pinecone, Weaviate).
- Multi-modal search (text + image): Weaviate, Qdrant.
- Managed cloud with SLA: Pinecone.
- On-premise, full control: Qdrant.
- 信号: 当文档数量超过 1000 万时(Pinecone, Weaviate)。
- 多模态搜索(文本+图像): Weaviate, Qdrant。
- 带 SLA 的托管云服务: Pinecone。
- 本地部署,完全控制: Qdrant。
For most enterprise RAG applications at typical document volumes, pgvector is the right starting point. Migrate when you hit actual limits, not anticipated ones. 对于大多数处于常规文档规模的企业级 RAG 应用,pgvector 是最合适的起点。请在遇到实际瓶颈时再进行迁移,而不是基于预期的瓶颈。
Decision 2: 768 Dimensions instead of 3072
决策 2:使用 768 维而非 3072 维
gemini-embedding-001 outputs 3072 dimensions by default. We set output_dimensionality=768. The constraint: pgvector’s HNSW index has a hard limit of 2000 dimensions.
gemini-embedding-001 默认输出 3072 维。我们将 output_dimensionality 设置为 768。限制条件是:pgvector 的 HNSW 索引有 2000 维的硬性上限。
Why not 2000? We chose 768 because: 为什么不选 2000? 我们选择 768 是因为:
- It’s a well-established embedding size used by BERT and many production systems.
- Cosine similarity quality degrades only slightly versus the full 3072 dims for typical retrieval tasks.
- Smaller vectors mean faster index builds and lower storage cost.
- 这是 BERT 和许多生产系统使用的成熟嵌入维度。
- 对于典型的检索任务,其余弦相似度质量相比 3072 维仅有轻微下降。
- 更小的向量意味着更快的索引构建速度和更低的存储成本。
Dimension vs. quality trade-off: 维度与质量的权衡:
| Dimensions | Index build | Storage | Retrieval quality |
|---|---|---|---|
| 256 | Fastest | Smallest | Noticeably lower |
| 768 | Fast | Small | Near full quality |
| 1536 | Moderate | Moderate | Full quality |
| 3072 | Slow | Largest | Full quality (no HNSW) |
| 维度 | 索引构建 | 存储 | 检索质量 |
|---|---|---|---|
| 256 | 最快 | 最小 | 明显较低 |
| 768 | 快 | 小 | 接近满质量 |
| 1536 | 中等 | 中等 | 满质量 |
| 3072 | 慢 | 最大 | 满质量 (不支持 HNSW) |
Decision 3: Asymmetric task_type
决策 3:非对称的 task_type
We used different task_type values for ingestion and querying:
我们在摄入和查询时使用了不同的 task_type 值:
# Ingestion
config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT", ...)
# Query
config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY", ...)
Why this matters: Gemini’s embedding model is trained with asymmetric objectives. A document and a query about the same topic are represented differently in embedding space — the model learns to map queries toward relevant documents, not to the same point. Using the same task type for both degrades retrieval accuracy. 为什么这很重要: Gemini 的嵌入模型是基于非对称目标训练的。同一主题的文档和查询在嵌入空间中的表示方式不同——模型学习的是将查询映射到相关文档,而不是映射到同一个点。对两者使用相同的任务类型会降低检索准确性。
This is analogous to how you’d phrase a document differently from a search query in natural language: “F1 Score is the harmonic mean of Precision and Recall” (document) vs. “how to calculate F1” (query). 这类似于你在自然语言中对文档和搜索查询的表述方式不同:“F1 分数是精确率和召回率的调和平均值”(文档)与“如何计算 F1”(查询)。
Decision 4: HNSW over IVFFlat
决策 4:选择 HNSW 而非 IVFFlat
pgvector supports two index types. We chose HNSW. pgvector 支持两种索引类型。我们选择了 HNSW。
| HNSW | IVFFlat | |
|---|---|---|
| Query speed | Fast | Moderate |
| Build time | Moderate | Fast |
| Memory | Higher | Lower |
| Accuracy at scale | Higher | Lower |
| Requires training data | No | Yes (needs VACUUM after inserts) |
| HNSW | IVFFlat | |
|---|---|---|
| 查询速度 | 快 | 中等 |
| 构建时间 | 中等 | 快 |
| 内存占用 | 较高 | 较低 |
| 大规模准确性 | 较高 | 较低 |
| 需要训练数据 | 否 | 是 (插入后需 VACUUM) |
HNSW is the better default for production. IVFFlat is worth considering only when you have very tight memory constraints and can afford slower queries. HNSW 是生产环境更好的默认选择。仅当你内存极其受限且能接受较慢的查询速度时,才考虑 IVFFlat。
HNSW parameter guide: HNSW 参数指南:
m = 16: max connections per node. Range: 4–64. Default 16 works for most cases.ef_construction = 64: search width during build. Range: 16–512. Default 64 is a good production starting point.m = 16:每个节点的最大连接数。范围:4–64。默认值 16 适用于大多数情况。ef_construction = 64:构建时的搜索宽度。范围:16–512。默认值 64 是生产环境良好的起点。
Decision 5: Gemini 2.5 Flash for Generation
决策 5:使用 Gemini 2.5 Flash 进行生成
We used gemini-2.5-flash rather than the more capable gemini-opus models. Reasoning:
我们使用了 gemini-2.5-flash 而非能力更强的 gemini-opus 模型。理由如下:
- Flash has sufficient quality for document-grounded Q&A — the retrieval step does the heavy lifting.
- Flash is faster and cheaper (or free-tier eligible during development).
- The generation prompt is constrained: “answer based on these documents” limits hallucination regardless of model capability.
- 对于基于文档的问答,Flash 的质量已足够——检索步骤承担了大部分核心工作。
- Flash 更快且更便宜(开发阶段可享受免费额度)。
- 生成提示词受到约束:“基于这些文档回答”限制了幻觉,无论模型能力如何。
When to upgrade the generation model: 何时升级生成模型:
- Complex multi-step reasoning across many documents.
- Synthesis tasks requiring cross-document inference.
- When evaluation scores (Faithfulness, Relevancy) are consistently below threshold.
- 跨多份文档的复杂多步推理。
- 需要跨文档推断的综合任务。
- 当评估分数(忠实度、相关性)持续低于阈值时。
When to upgrade the embedding model: 何时升级嵌入模型:
- Low Context Recall — the right documents aren’t being retrieved.
- Evaluation reveals semantic mismatch between queries and stored documents.
- 上下文召回率低——未能检索到正确的文档。
- 评估显示查询与存储文档之间存在语义不匹配。
The embedding model matters more for retrieval quality. The generation model matters more for answer quality. Optimize them independently. 嵌入模型对检索质量更重要,生成模型对答案质量更重要。请独立优化它们。
The Scaling Path
扩展路径
This architecture scales predictably: 该架构具有可预测的扩展性:
- Phase 1 (now): pgvector local → works to ~1M docs.
- Phase 2: pgvector + Supabase → managed PostgreSQL, easy scaling.
- Phase 3: pgvector + read replicas → horizontal query scaling.
- Phase 4: Dedicated vector DB → if you genuinely outgrow pgvector.
- 第一阶段(当前): 本地 pgvector → 可支持约 100 万份文档。
- 第二阶段: pgvector + Supabase → 托管 PostgreSQL,轻松扩展。
- 第三阶段: pgvector + 只读副本 → 水平查询扩展。
- 第四阶段: 专用向量数据库 → 如果你确实超出了 pgvector 的能力范围。
Most teams never reach Phase 4. Start at Phase 1, move when you have evidence you need to. 大多数团队永远不会达到第四阶段。从第一阶段开始,当你确实有证据表明需要时再进行迁移。
Common Pitfalls
常见陷阱
- Chunking strategy matters more than model choice. If your documents are long (PDFs, reports), how you split them into chunks dramatically affects retrieval quality. A naive split at 512 tokens often breaks context mid-sentence. Consider semantic chunking or overlap.
- Don’t embed the question alone. For complex questions, consider HyDE (Hypothetical Document Embedding): generate a hypothetical answer to the question, embed that, then search. This often retrieves better documents than embedding the raw question.
- Reranking improves precision. After vector search returns top-k candidates, a cross-encoder reranker (like Cohere Rerank) re-scores them for precision. Add this when recall is good but final answer quality is inconsistent.
- 分块策略比模型选择更重要。 如果你的文档很长(PDF、报告),如何将其拆分为块会极大地影响检索质量。简单的 512 token 拆分往往会在句子中间切断上下文。考虑使用语义分块或重叠。
- 不要只嵌入问题本身。 对于复杂问题,考虑使用 HyDE(假设文档嵌入):生成一个假设的答案,嵌入该答案,然后进行搜索。这通常比直接嵌入原始问题能检索到更好的文档。
- 重排序(Reranking)可提高精确度。 在向量搜索返回 top-k 候选结果后,使用交叉编码器重排序器(如 Cohere Rerank)对它们进行重新评分以提高精确度。当召回率良好但最终答案质量不稳定时,请添加此步骤。
In the next article, we’ll give the LLM the ability to call these search functions autonomously using Tool Use. 在下一篇文章中,我们将通过“工具调用”(Tool Use)赋予 LLM 自主调用这些搜索函数的能力。