Building a RAG System from Scratch with pgvector and Gemini — Implementation

使用 pgvector 和 Gemini 从零构建 RAG 系统 — 实现篇

In the previous article, we covered the three core concepts behind RAG. Now let’s build it. By the end of this article you’ll have a working RAG pipeline: documents stored as vectors in pgvector, semantic search retrieving the right context, and Gemini generating grounded answers. 在上一篇文章中，我们介绍了 RAG 背后的三个核心概念。现在让我们开始构建它。读完本文，你将拥有一个可运行的 RAG 流水线：文档以向量形式存储在 pgvector 中，通过语义搜索检索相关上下文，并由 Gemini 生成基于事实的回答。

Environment Setup

环境设置

Prerequisites 先决条件

Python 3.12 (pyenv recommended)
Docker
Google Gemini API key — get one free at aistudio.google.com
Python 3.12（推荐使用 pyenv）
Docker
Google Gemini API 密钥 — 可在 aistudio.google.com 免费获取

Project setup 项目设置

mkdir pgvector-tutorial && cd pgvector-tutorial
pyenv local 3.12.0
python -m venv .venv
source .venv/bin/activate
pip install psycopg2-binary google-genai python-dotenv
pip freeze > requirements.txt

Use google-genai (new package), not google-generativeai (deprecated). 请使用 google-genai（新包），不要使用 google-generativeai（已弃用）。

Start pgvector with Docker 使用 Docker 启动 pgvector

docker run -d \
  --name pgvector-demo \
  -e POSTGRES_PASSWORD=password \
  -e POSTGRES_DB=vectordb \
  -p 5432:5432 \
  pgvector/pgvector:pg16

.env file .env 文件

GEMINI_API_KEY=AIza...
DB_HOST=localhost
DB_PORT=5432
DB_NAME=vectordb
DB_USER=postgres
DB_PASSWORD=password

Directory Structure 目录结构 We’ll build these five files in order: 我们将按顺序创建以下五个文件：

pgvector-tutorial/
├── 01_setup_db.py    # Create table + enable pgvector
├── 02_create_index.py # HNSW index
├── 03_ingest.py      # Embed documents and store
├── 04_search.py      # Vector search
└── 05_rag.py         # Full RAG pipeline

Step 1: Database Setup — 01_setup_db.py

第一步：数据库设置 — 01_setup_db.py

import psycopg2
from dotenv import load_dotenv
import os

load_dotenv()

conn = psycopg2.connect(
    host=os.getenv("DB_HOST"),
    port=os.getenv("DB_PORT"),
    dbname=os.getenv("DB_NAME"),
    user=os.getenv("DB_USER"),
    password=os.getenv("DB_PASSWORD"),
)

cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute("""
    CREATE TABLE IF NOT EXISTS documents (
        id SERIAL PRIMARY KEY,
        title TEXT NOT NULL,
        body TEXT NOT NULL,
        category TEXT,
        created_at TIMESTAMP DEFAULT NOW(),
        embedding vector(768)
    );
""")
conn.commit()
print("Table created.")

python 01_setup_db.py

Why 768 dimensions? gemini-embedding-001 outputs 3072 dimensions by default, but pgvector’s HNSW index has a 2000-dimension limit. Setting output_dimensionality=768 keeps us well within that limit with negligible quality loss. 为什么是 768 维？ gemini-embedding-001 默认输出 3072 维，但 pgvector 的 HNSW 索引有 2000 维的限制。设置 output_dimensionality=768 可以让我们保持在限制范围内，且质量损失微乎其微。

Step 2: HNSW Index — 02_create_index.py

第二步：HNSW 索引 — 02_create_index.py

import psycopg2
from dotenv import load_dotenv
import os

load_dotenv()

conn = psycopg2.connect(...) # (Same connection logic as above)
cur = conn.cursor()

cur.execute("""
    CREATE INDEX IF NOT EXISTS docs_embedding_idx 
    ON documents USING hnsw (embedding vector_cosine_ops) 
    WITH (m = 16, ef_construction = 64);
""")
conn.commit()
print("Index created.")

python 02_create_index.py

HNSW parameter reference: HNSW 参数参考：

Use case	m	ef_construction
Dev / testing	8	32
Production (standard)	16	64
High accuracy	32	128

Step 3: Ingest Documents — 03_ingest.py

第三步：文档入库 — 03_ingest.py

import psycopg2
from google import genai
from google.genai import types
from dotenv import load_dotenv
import os

load_dotenv()
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
# ... (Connection logic)

def get_embedding(text: str) -> list[float]:
    result = client.models.embed_content(
        model="gemini-embedding-001",
        contents=text,
        config=types.EmbedContentConfig(
            task_type="RETRIEVAL_DOCUMENT", # use RETRIEVAL_DOCUMENT for storage
            output_dimensionality=768,
        ),
    )
    return result.embeddings[0].values

def insert_document(title: str, body: str, category: str) -> int:
    embedding = get_embedding(f"{title}\n\n{body}")
    cur.execute("""
        INSERT INTO documents (title, body, category, embedding) 
        VALUES (%s, %s, %s, %s) RETURNING id;
    """, (title, body, category, embedding))
    doc_id = cur.fetchone()[0]
    conn.commit()
    return doc_id

# ... (Sample data insertion loop)

python 03_ingest.py

task_type matters: Use RETRIEVAL_DOCUMENT when storing and RETRIEVAL_QUERY when searching. This asymmetric setup improves retrieval accuracy. 任务类型很重要： 存储时使用 RETRIEVAL_DOCUMENT，搜索时使用 RETRIEVAL_QUERY。这种非对称设置可以提高检索准确性。

Step 4: Vector Search — 04_search.py

第四步：向量搜索 — 04_search.py

# ... (Setup client and connection)

def get_query_embedding(text: str) -> list[float]:
    result = client.models.embed_content(
        model="gemini-embedding-001",
        contents=text,
        config=types.EmbedContentConfig(
            task_type="RETRIEVAL_QUERY", # use RETRIEVAL_QUERY for search
            output_dimensionality=768,
        ),
    )
    return result.embeddings[0].values

def search(query: str, top_k: int = 3) -> list[dict]:
    query_embedding = get_query_embedding(query)
    cur.execute("""
        SELECT id, title, category, 1 - (embedding <=> %s::vector) AS similarity 
        FROM documents 
        ORDER BY embedding <=> %s::vector 
        LIMIT %s;
    """, (query_embedding, query_embedding, top_k))
    # ... (Return results)