Building a RAG System from Scratch — Wrap-up and What Comes Next
Building a RAG System from Scratch — Wrap-up and What Comes Next
从零构建 RAG 系统 — 总结与展望
In this final article, we’ll recap what we built across the series, consolidate the design decisions, and point to where to go next. 在这一系列的最后一篇文章中,我们将回顾整个系列所构建的内容,梳理设计决策,并指明接下来的学习方向。
What We Built
我们构建了什么
Starting from a blank Python project, we built a complete AI system step by step: 从一个空白的 Python 项目开始,我们一步步构建了一个完整的 AI 系统:
- 01_setup_db.py: pgvector table + extension
- 02_create_index.py: HNSW index (m=16, ef_construction=64)
- 03_ingest.py: Embed documents → store in pgvector
- 04_search.py: Cosine similarity search
- 05_rag.py: Full RAG pipeline
- 06_tool_basic.py: LLM decides whether to search
- 07_tool_multi.py: LLM routes between multiple tools
- 08_tool_agent.py: Multi-step agentic loop
- 09_agent_basic.py: ReAct pattern
- 10_agent_memory.py: Persistent memory across sessions
- 11_agent_planner.py: Plan → Execute → Evaluate
- mcp_server/server.py: MCP server (stdio, Claude Desktop)
- server_http.py: MCP server (HTTP)
- server_render.py: MCP server (Render deployment)
- 12_mcp_agent.py: Agent via MCP (local)
- 13_mcp_http_agent.py: Agent via MCP (cloud)
Design Decisions at a Glance
设计决策概览
pgvector over a dedicated vector DB 选择 pgvector 而非专用向量数据库 pgvector integrates with existing PostgreSQL, supports SQL + vector in one query, and handles millions of documents comfortably. Start here and migrate only when you have evidence you need to. pgvector 与现有的 PostgreSQL 集成,支持在单个查询中同时使用 SQL 和向量,并能轻松处理数百万份文档。从这里开始,只有当你确实有证据表明需要迁移时,再考虑更换。
768 dimensions 768 维度 gemini-embedding-001 outputs 3072 dims by default, but pgvector’s HNSW index has a 2000-dim hard limit. 768 dims stays well within bounds with negligible quality loss. gemini-embedding-001 默认输出 3072 维,但 pgvector 的 HNSW 索引有 2000 维的硬性限制。768 维完全在限制范围内,且质量损失微乎其微。
Asymmetric task_type
非对称 task_type
Use RETRIEVAL_DOCUMENT when storing, RETRIEVAL_QUERY when searching. The Gemini embedding model is trained to map queries toward documents, not to the same point. Using the same task type for both degrades retrieval accuracy.
存储时使用 RETRIEVAL_DOCUMENT,搜索时使用 RETRIEVAL_QUERY。Gemini 嵌入模型经过训练,旨在将查询映射到文档方向,而非映射到同一点。两者使用相同的任务类型会降低检索准确性。
HNSW over IVFFlat 选择 HNSW 而非 IVFFlat HNSW requires no training data, delivers consistent recall at scale, and is faster at query time. IVFFlat is only worth considering under tight memory constraints. HNSW 不需要训练数据,在大规模下能提供一致的召回率,且查询速度更快。只有在内存极其受限的情况下,才值得考虑 IVFFlat。
Tool description is routing logic 工具描述即路由逻辑 The LLM selects tools based on their description field. Precise, distinguishing descriptions produce correct tool selection. Vague descriptions produce random behavior. LLM 根据工具的描述字段来选择工具。精确且具有区分度的描述能带来正确的工具选择,而模糊的描述会导致随机行为。
The conversation history is the agent’s memory 对话历史即代理的记忆 Each tool call and result gets appended to contents. The LLM reads the full history on every step — this is how multi-step reasoning works. 每次工具调用及其结果都会被追加到内容中。LLM 在每一步都会读取完整的历史记录——这就是多步推理的工作原理。
MCP makes tools infrastructure MCP 让工具成为基础设施 MCP turns hardcoded functions into a standalone server. Claude Desktop, Gemini agents, and any future client can connect to the same server without duplicating tool definitions. MCP 将硬编码的函数转变为独立的服务器。Claude Desktop、Gemini 代理以及任何未来的客户端都可以连接到同一个服务器,而无需重复定义工具。
Render + Supabase for zero-cost cloud deployment 使用 Render + Supabase 实现零成本云部署 Render’s free web service hosts the MCP server. Supabase’s free tier hosts pgvector. The Connection Pooler (port 6543) is mandatory — Render doesn’t support the IPv6 used by Supabase’s standard port 5432. Render 的免费 Web 服务托管 MCP 服务器,Supabase 的免费层托管 pgvector。连接池(端口 6543)是必须的——因为 Render 不支持 Supabase 标准端口 5432 所使用的 IPv6。
The Architecture We Ended Up With
我们最终的架构
Local: Claude Desktop ↓ stdio mcp_server/server.py ↓ psycopg2 pgvector (Docker)
Cloud: Python agent (13_mcp_http_agent.py) ↓ HTTPS Render (server_render.py) ↓ PostgreSQL + SSL (port 6543) Supabase (pgvector) ↓ Gemini Embedding + LLM
What This Series Did Not Cover
本系列未涵盖的内容
This series focused on getting a production-ready RAG system off the ground. Several important topics are out of scope here: 本系列专注于让一个生产就绪的 RAG 系统落地。以下几个重要主题超出了本文范围:
- Evaluation (Evals) — How do you know if your RAG is actually working? You need automated quality measurement: Context Recall, Answer Relevancy, and Faithfulness scoring. 评估 (Evals) — 如何确定你的 RAG 是否真的有效?你需要自动化的质量衡量:上下文召回率、答案相关性和忠实度评分。
- Observability — When something goes wrong in production, how do you debug it? Tracing each step with a tool like Langfuse tells you exactly where latency or quality issues originate. 可观测性 — 当生产环境出现问题时,如何调试?使用 Langfuse 等工具追踪每一步,可以准确指出延迟或质量问题的根源。
- Security — How do you handle adversarial inputs? Prompt injection, jailbreaks, and PII leakage are real threats in any public-facing RAG system. 安全性 — 如何处理对抗性输入?提示词注入、越狱和个人隐私信息(PII)泄露是任何面向公众的 RAG 系统面临的真实威胁。
- MLOps / LLMOps — How do you ship changes safely? Prompt versioning, CI/CD quality gates, and API cost tracking become essential when the system is in production. MLOps / LLMOps — 如何安全地发布变更?当系统进入生产环境时,提示词版本控制、CI/CD 质量门禁和 API 成本追踪变得至关重要。
- Fine-tuning — When the base model doesn’t behave the way you need, LoRA fine-tuning lets you adapt it to your domain with surprisingly little data and compute. 微调 — 当基础模型表现不符合需求时,LoRA 微调能让你以极少的数据和计算资源将其适配到你的领域。
- Multi-Agent Systems — When a single agent isn’t enough, orchestrator-worker patterns distribute work across specialized agents. 多代理系统 — 当单个代理不足以胜任时,编排者-工作者(orchestrator-worker)模式可以将工作分配给专门的代理。
- Governance — The EU AI Act is now fully in force. Compliance for a chatbot system means AI disclosure notices, audit logging, and a documented risk assessment. 治理 — 欧盟《人工智能法案》现已全面生效。聊天机器人系统的合规性意味着需要 AI 披露声明、审计日志和文档化的风险评估。
All of these are covered in Vol.2 of this series. 以上所有内容都将在本系列的第二卷中涵盖。
Vol.2: Production Operations Guide
第二卷:生产运维指南
The second series picks up where this one leaves off — taking a working RAG system and making it production-grade. 第二系列将从这里继续——将一个可用的 RAG 系统提升至生产级标准。
| Chapter | Topic |
|---|---|
| 1 | What “production” actually means |
| 2 | Evals — automated quality measurement |
| 3 | Observability with Langfuse v4 |
| 4 | Security — guardrails and prompt injection defense |
| 5 | MLOps / LLMOps — CI/CD pipeline |
| 6 | Fine-tuning with LoRA |
| 7 | Multi-Agent: orchestrator-worker pattern |
| 8 | Governance — EU AI Act compliance |
| 9 | Wrap-up |
| 章节 | 主题 |
|---|---|
| 1 | 什么是真正的“生产环境” |
| 2 | 评估 — 自动化质量衡量 |
| 3 | 使用 Langfuse v4 进行可观测性监控 |
| 4 | 安全性 — 防护栏与提示词注入防御 |
| 5 | MLOps / LLMOps — CI/CD 流水线 |
| 6 | 使用 LoRA 进行微调 |
| 7 | 多代理:编排者-工作者模式 |
| 8 | 治理 — 欧盟《人工智能法案》合规 |
| 9 | 总结 |
Source Code
源代码
Everything built in this series is in one repository: github.com/qameqame/pgvector-tutorial 本系列构建的所有内容都在一个仓库中:github.com/qameqame/pgvector-tutorial
The README covers setup, directory structure, and the reasoning behind each design decision. README 文件涵盖了设置、目录结构以及每个设计决策背后的逻辑。
Series Index
系列索引
-
Introduction
-
RAG · Embedding · Vector DB Implementation
-
Design Decisions Explained
-
Tool Use — Autonomous Search
-
AI Agents — Memory and Planning
-
MCP — Reusable Tool Server
-
Cloud Deployment — Render × Supabase
-
Wrap-up and Next Steps (this article)
-
简介
-
RAG · 嵌入 · 向量数据库实现
-
设计决策详解
-
工具使用 — 自主搜索
-
AI 代理 — 记忆与规划
-
MCP — 可复用的工具服务器
-
云部署 — Render × Supabase
-
总结与展望(本文)
Thanks for following along. If you found this useful, the GitHub repo and Vol.2 are the best places to continue. 感谢您的阅读。如果您觉得本系列有用,GitHub 仓库和第二卷将是您继续学习的最佳去处。