Building a RAG System from Scratch with pgvector and Gemini — Implementation
Building a RAG System from Scratch with pgvector and Gemini — Implementation
使用 pgvector 和 Gemini 从零构建 RAG 系统 — 实现篇
In the previous article, we covered the three core concepts behind RAG. Now let’s build it. By the end of this article you’ll have a working RAG pipeline: documents stored as vectors in pgvector, semantic search retrieving the right context, and Gemini generating grounded answers. 在上一篇文章中,我们介绍了 RAG 背后的三个核心概念。现在让我们开始构建它。读完本文,你将拥有一个可运行的 RAG 流水线:文档以向量形式存储在 pgvector 中,通过语义搜索检索相关上下文,并由 Gemini 生成基于事实的回答。
Environment Setup
环境设置
Prerequisites 先决条件
- Python 3.12 (pyenv recommended)
- Docker
- Google Gemini API key — get one free at aistudio.google.com
- Python 3.12(推荐使用 pyenv)
- Docker
- Google Gemini API 密钥 — 可在 aistudio.google.com 免费获取
Project setup 项目设置
mkdir pgvector-tutorial && cd pgvector-tutorial
pyenv local 3.12.0
python -m venv .venv
source .venv/bin/activate
pip install psycopg2-binary google-genai python-dotenv
pip freeze > requirements.txt
Use google-genai (new package), not google-generativeai (deprecated).
请使用 google-genai(新包),不要使用 google-generativeai(已弃用)。
Start pgvector with Docker 使用 Docker 启动 pgvector
docker run -d \
--name pgvector-demo \
-e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=vectordb \
-p 5432:5432 \
pgvector/pgvector:pg16
.env file .env 文件
GEMINI_API_KEY=AIza...
DB_HOST=localhost
DB_PORT=5432
DB_NAME=vectordb
DB_USER=postgres
DB_PASSWORD=password
Directory Structure 目录结构 We’ll build these five files in order: 我们将按顺序创建以下五个文件:
pgvector-tutorial/
├── 01_setup_db.py # Create table + enable pgvector
├── 02_create_index.py # HNSW index
├── 03_ingest.py # Embed documents and store
├── 04_search.py # Vector search
└── 05_rag.py # Full RAG pipeline
Step 1: Database Setup — 01_setup_db.py
第一步:数据库设置 — 01_setup_db.py
import psycopg2
from dotenv import load_dotenv
import os
load_dotenv()
conn = psycopg2.connect(
host=os.getenv("DB_HOST"),
port=os.getenv("DB_PORT"),
dbname=os.getenv("DB_NAME"),
user=os.getenv("DB_USER"),
password=os.getenv("DB_PASSWORD"),
)
cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
body TEXT NOT NULL,
category TEXT,
created_at TIMESTAMP DEFAULT NOW(),
embedding vector(768)
);
""")
conn.commit()
print("Table created.")
python 01_setup_db.py
Why 768 dimensions?
gemini-embedding-001 outputs 3072 dimensions by default, but pgvector’s HNSW index has a 2000-dimension limit. Setting output_dimensionality=768 keeps us well within that limit with negligible quality loss.
为什么是 768 维?
gemini-embedding-001 默认输出 3072 维,但 pgvector 的 HNSW 索引有 2000 维的限制。设置 output_dimensionality=768 可以让我们保持在限制范围内,且质量损失微乎其微。
Step 2: HNSW Index — 02_create_index.py
第二步:HNSW 索引 — 02_create_index.py
import psycopg2
from dotenv import load_dotenv
import os
load_dotenv()
conn = psycopg2.connect(...) # (Same connection logic as above)
cur = conn.cursor()
cur.execute("""
CREATE INDEX IF NOT EXISTS docs_embedding_idx
ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
""")
conn.commit()
print("Index created.")
python 02_create_index.py
HNSW parameter reference: HNSW 参数参考:
| Use case | m | ef_construction |
|---|---|---|
| Dev / testing | 8 | 32 |
| Production (standard) | 16 | 64 |
| High accuracy | 32 | 128 |
Step 3: Ingest Documents — 03_ingest.py
第三步:文档入库 — 03_ingest.py
import psycopg2
from google import genai
from google.genai import types
from dotenv import load_dotenv
import os
load_dotenv()
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
# ... (Connection logic)
def get_embedding(text: str) -> list[float]:
result = client.models.embed_content(
model="gemini-embedding-001",
contents=text,
config=types.EmbedContentConfig(
task_type="RETRIEVAL_DOCUMENT", # use RETRIEVAL_DOCUMENT for storage
output_dimensionality=768,
),
)
return result.embeddings[0].values
def insert_document(title: str, body: str, category: str) -> int:
embedding = get_embedding(f"{title}\n\n{body}")
cur.execute("""
INSERT INTO documents (title, body, category, embedding)
VALUES (%s, %s, %s, %s) RETURNING id;
""", (title, body, category, embedding))
doc_id = cur.fetchone()[0]
conn.commit()
return doc_id
# ... (Sample data insertion loop)
python 03_ingest.py
task_type matters: Use RETRIEVAL_DOCUMENT when storing and RETRIEVAL_QUERY when searching. This asymmetric setup improves retrieval accuracy.
任务类型很重要: 存储时使用 RETRIEVAL_DOCUMENT,搜索时使用 RETRIEVAL_QUERY。这种非对称设置可以提高检索准确性。
Step 4: Vector Search — 04_search.py
第四步:向量搜索 — 04_search.py
# ... (Setup client and connection)
def get_query_embedding(text: str) -> list[float]:
result = client.models.embed_content(
model="gemini-embedding-001",
contents=text,
config=types.EmbedContentConfig(
task_type="RETRIEVAL_QUERY", # use RETRIEVAL_QUERY for search
output_dimensionality=768,
),
)
return result.embeddings[0].values
def search(query: str, top_k: int = 3) -> list[dict]:
query_embedding = get_query_embedding(query)
cur.execute("""
SELECT id, title, category, 1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s;
""", (query_embedding, query_embedding, top_k))
# ... (Return results)