Building a RAG System from Scratch — Design Decisions Explained

Building a RAG System from Scratch — Design Decisions Explained

从零构建 RAG 系统——设计决策详解

In the previous article, we built a working RAG pipeline. Now let’s step back and ask why we made each design decision — and what alternatives exist when your requirements change. 在上一篇文章中,我们构建了一个可运行的 RAG 流水线。现在,让我们退一步思考:为什么我们做出了这些设计决策?当你的需求发生变化时,又有哪些替代方案?

The Full Picture

全景概览

Here’s what we built: 这是我们构建的系统架构:

  • Ingest phase: Text → gemini-embedding-001 (RETRIEVAL_DOCUMENT, 768 dims) → pgvector (HNSW index, cosine similarity)

  • Query phase: Question → gemini-embedding-001 (RETRIEVAL_QUERY, 768 dims) → pgvector search (top-k) → Gemini 2.5 Flash (answer generation)

  • 摄入阶段: 文本 → gemini-embedding-001 (RETRIEVAL_DOCUMENT, 768 维) → pgvector (HNSW 索引,余弦相似度)

  • 查询阶段: 问题 → gemini-embedding-001 (RETRIEVAL_QUERY, 768 维) → pgvector 搜索 (top-k) → Gemini 2.5 Flash (答案生成)

Every element in this diagram was a choice. Let’s examine each one. 图中的每一个元素都是经过选择的。让我们逐一分析。


Decision 1: pgvector over a Dedicated Vector DB

决策 1:选择 pgvector 而非专用向量数据库

We used pgvector, a PostgreSQL extension, rather than a purpose-built vector database like Pinecone, Weaviate, or Qdrant. 我们使用了 PostgreSQL 的扩展插件 pgvector,而不是 Pinecone、Weaviate 或 Qdrant 等专用向量数据库。

Why pgvector works here: 为什么 pgvector 在此适用:

  • Integrates with existing PostgreSQL infrastructure — no new service to operate.
  • SQL and vector search in the same query: filter by category, join with other tables, all in one round-trip.
  • Handles millions of documents comfortably with HNSW indexing.
  • 与现有的 PostgreSQL 基础设施集成,无需运维新服务。
  • 在同一查询中结合 SQL 和向量搜索:按类别过滤、与其他表联接,一次往返即可完成。
  • 通过 HNSW 索引,可以轻松处理数百万份文档。

When to consider a dedicated vector DB: 何时考虑专用向量数据库:

  • Signal: Consider moving to > 10M documents (Pinecone, Weaviate).
  • Multi-modal search (text + image): Weaviate, Qdrant.
  • Managed cloud with SLA: Pinecone.
  • On-premise, full control: Qdrant.
  • 信号: 当文档数量超过 1000 万时(Pinecone, Weaviate)。
  • 多模态搜索(文本+图像): Weaviate, Qdrant。
  • 带 SLA 的托管云服务: Pinecone。
  • 本地部署,完全控制: Qdrant。

For most enterprise RAG applications at typical document volumes, pgvector is the right starting point. Migrate when you hit actual limits, not anticipated ones. 对于大多数处于常规文档规模的企业级 RAG 应用,pgvector 是最合适的起点。请在遇到实际瓶颈时再进行迁移,而不是基于预期的瓶颈。


Decision 2: 768 Dimensions instead of 3072

决策 2:使用 768 维而非 3072 维

gemini-embedding-001 outputs 3072 dimensions by default. We set output_dimensionality=768. The constraint: pgvector’s HNSW index has a hard limit of 2000 dimensions. gemini-embedding-001 默认输出 3072 维。我们将 output_dimensionality 设置为 768。限制条件是:pgvector 的 HNSW 索引有 2000 维的硬性上限。

Why not 2000? We chose 768 because: 为什么不选 2000? 我们选择 768 是因为:

  • It’s a well-established embedding size used by BERT and many production systems.
  • Cosine similarity quality degrades only slightly versus the full 3072 dims for typical retrieval tasks.
  • Smaller vectors mean faster index builds and lower storage cost.
  • 这是 BERT 和许多生产系统使用的成熟嵌入维度。
  • 对于典型的检索任务,其余弦相似度质量相比 3072 维仅有轻微下降。
  • 更小的向量意味着更快的索引构建速度和更低的存储成本。

Dimension vs. quality trade-off: 维度与质量的权衡:

DimensionsIndex buildStorageRetrieval quality
256FastestSmallestNoticeably lower
768FastSmallNear full quality
1536ModerateModerateFull quality
3072SlowLargestFull quality (no HNSW)
维度索引构建存储检索质量
256最快最小明显较低
768接近满质量
1536中等中等满质量
3072最大满质量 (不支持 HNSW)

Decision 3: Asymmetric task_type

决策 3:非对称的 task_type

We used different task_type values for ingestion and querying: 我们在摄入和查询时使用了不同的 task_type 值:

# Ingestion
config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT", ...)
# Query
config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY", ...)

Why this matters: Gemini’s embedding model is trained with asymmetric objectives. A document and a query about the same topic are represented differently in embedding space — the model learns to map queries toward relevant documents, not to the same point. Using the same task type for both degrades retrieval accuracy. 为什么这很重要: Gemini 的嵌入模型是基于非对称目标训练的。同一主题的文档和查询在嵌入空间中的表示方式不同——模型学习的是将查询映射到相关文档,而不是映射到同一个点。对两者使用相同的任务类型会降低检索准确性。

This is analogous to how you’d phrase a document differently from a search query in natural language: “F1 Score is the harmonic mean of Precision and Recall” (document) vs. “how to calculate F1” (query). 这类似于你在自然语言中对文档和搜索查询的表述方式不同:“F1 分数是精确率和召回率的调和平均值”(文档)与“如何计算 F1”(查询)。


Decision 4: HNSW over IVFFlat

决策 4:选择 HNSW 而非 IVFFlat

pgvector supports two index types. We chose HNSW. pgvector 支持两种索引类型。我们选择了 HNSW。

HNSWIVFFlat
Query speedFastModerate
Build timeModerateFast
MemoryHigherLower
Accuracy at scaleHigherLower
Requires training dataNoYes (needs VACUUM after inserts)
HNSWIVFFlat
查询速度中等
构建时间中等
内存占用较高较低
大规模准确性较高较低
需要训练数据是 (插入后需 VACUUM)

HNSW is the better default for production. IVFFlat is worth considering only when you have very tight memory constraints and can afford slower queries. HNSW 是生产环境更好的默认选择。仅当你内存极其受限且能接受较慢的查询速度时,才考虑 IVFFlat。

HNSW parameter guide: HNSW 参数指南:

  • m = 16: max connections per node. Range: 4–64. Default 16 works for most cases.
  • ef_construction = 64: search width during build. Range: 16–512. Default 64 is a good production starting point.
  • m = 16:每个节点的最大连接数。范围:4–64。默认值 16 适用于大多数情况。
  • ef_construction = 64:构建时的搜索宽度。范围:16–512。默认值 64 是生产环境良好的起点。

Decision 5: Gemini 2.5 Flash for Generation

决策 5:使用 Gemini 2.5 Flash 进行生成

We used gemini-2.5-flash rather than the more capable gemini-opus models. Reasoning: 我们使用了 gemini-2.5-flash 而非能力更强的 gemini-opus 模型。理由如下:

  • Flash has sufficient quality for document-grounded Q&A — the retrieval step does the heavy lifting.
  • Flash is faster and cheaper (or free-tier eligible during development).
  • The generation prompt is constrained: “answer based on these documents” limits hallucination regardless of model capability.
  • 对于基于文档的问答,Flash 的质量已足够——检索步骤承担了大部分核心工作。
  • Flash 更快且更便宜(开发阶段可享受免费额度)。
  • 生成提示词受到约束:“基于这些文档回答”限制了幻觉,无论模型能力如何。

When to upgrade the generation model: 何时升级生成模型:

  • Complex multi-step reasoning across many documents.
  • Synthesis tasks requiring cross-document inference.
  • When evaluation scores (Faithfulness, Relevancy) are consistently below threshold.
  • 跨多份文档的复杂多步推理。
  • 需要跨文档推断的综合任务。
  • 当评估分数(忠实度、相关性)持续低于阈值时。

When to upgrade the embedding model: 何时升级嵌入模型:

  • Low Context Recall — the right documents aren’t being retrieved.
  • Evaluation reveals semantic mismatch between queries and stored documents.
  • 上下文召回率低——未能检索到正确的文档。
  • 评估显示查询与存储文档之间存在语义不匹配。

The embedding model matters more for retrieval quality. The generation model matters more for answer quality. Optimize them independently. 嵌入模型对检索质量更重要,生成模型对答案质量更重要。请独立优化它们。


The Scaling Path

扩展路径

This architecture scales predictably: 该架构具有可预测的扩展性:

  • Phase 1 (now): pgvector local → works to ~1M docs.
  • Phase 2: pgvector + Supabase → managed PostgreSQL, easy scaling.
  • Phase 3: pgvector + read replicas → horizontal query scaling.
  • Phase 4: Dedicated vector DB → if you genuinely outgrow pgvector.
  • 第一阶段(当前): 本地 pgvector → 可支持约 100 万份文档。
  • 第二阶段: pgvector + Supabase → 托管 PostgreSQL,轻松扩展。
  • 第三阶段: pgvector + 只读副本 → 水平查询扩展。
  • 第四阶段: 专用向量数据库 → 如果你确实超出了 pgvector 的能力范围。

Most teams never reach Phase 4. Start at Phase 1, move when you have evidence you need to. 大多数团队永远不会达到第四阶段。从第一阶段开始,当你确实有证据表明需要时再进行迁移。


Common Pitfalls

常见陷阱

  • Chunking strategy matters more than model choice. If your documents are long (PDFs, reports), how you split them into chunks dramatically affects retrieval quality. A naive split at 512 tokens often breaks context mid-sentence. Consider semantic chunking or overlap.
  • Don’t embed the question alone. For complex questions, consider HyDE (Hypothetical Document Embedding): generate a hypothetical answer to the question, embed that, then search. This often retrieves better documents than embedding the raw question.
  • Reranking improves precision. After vector search returns top-k candidates, a cross-encoder reranker (like Cohere Rerank) re-scores them for precision. Add this when recall is good but final answer quality is inconsistent.
  • 分块策略比模型选择更重要。 如果你的文档很长(PDF、报告),如何将其拆分为块会极大地影响检索质量。简单的 512 token 拆分往往会在句子中间切断上下文。考虑使用语义分块或重叠。
  • 不要只嵌入问题本身。 对于复杂问题,考虑使用 HyDE(假设文档嵌入):生成一个假设的答案,嵌入该答案,然后进行搜索。这通常比直接嵌入原始问题能检索到更好的文档。
  • 重排序(Reranking)可提高精确度。 在向量搜索返回 top-k 候选结果后,使用交叉编码器重排序器(如 Cohere Rerank)对它们进行重新评分以提高精确度。当召回率良好但最终答案质量不稳定时,请添加此步骤。

In the next article, we’ll give the LLM the ability to call these search functions autonomously using Tool Use. 在下一篇文章中,我们将通过“工具调用”(Tool Use)赋予 LLM 自主调用这些搜索函数的能力。