Building a RAG System from Scratch with pgvector and Gemini — Implementation

Building a RAG System from Scratch with pgvector and Gemini — Implementation

使用 pgvector 和 Gemini 从零构建 RAG 系统 — 实现篇

In the previous article, we covered the three core concepts behind RAG. Now let’s build it. By the end of this article you’ll have a working RAG pipeline: documents stored as vectors in pgvector, semantic search retrieving the right context, and Gemini generating grounded answers. 在上一篇文章中,我们介绍了 RAG 背后的三个核心概念。现在让我们开始构建它。读完本文,你将拥有一个可运行的 RAG 流水线:文档以向量形式存储在 pgvector 中,通过语义搜索检索相关上下文,并由 Gemini 生成基于事实的回答。

Environment Setup

环境设置

Prerequisites 先决条件

  • Python 3.12 (pyenv recommended)
  • Docker
  • Google Gemini API key — get one free at aistudio.google.com
  • Python 3.12(推荐使用 pyenv)
  • Docker
  • Google Gemini API 密钥 — 可在 aistudio.google.com 免费获取

Project setup 项目设置

mkdir pgvector-tutorial && cd pgvector-tutorial
pyenv local 3.12.0
python -m venv .venv
source .venv/bin/activate
pip install psycopg2-binary google-genai python-dotenv
pip freeze > requirements.txt

Use google-genai (new package), not google-generativeai (deprecated). 请使用 google-genai(新包),不要使用 google-generativeai(已弃用)。

Start pgvector with Docker 使用 Docker 启动 pgvector

docker run -d \
  --name pgvector-demo \
  -e POSTGRES_PASSWORD=password \
  -e POSTGRES_DB=vectordb \
  -p 5432:5432 \
  pgvector/pgvector:pg16

.env file .env 文件

GEMINI_API_KEY=AIza...
DB_HOST=localhost
DB_PORT=5432
DB_NAME=vectordb
DB_USER=postgres
DB_PASSWORD=password

Directory Structure 目录结构 We’ll build these five files in order: 我们将按顺序创建以下五个文件:

pgvector-tutorial/
├── 01_setup_db.py    # Create table + enable pgvector
├── 02_create_index.py # HNSW index
├── 03_ingest.py      # Embed documents and store
├── 04_search.py      # Vector search
└── 05_rag.py         # Full RAG pipeline

Step 1: Database Setup — 01_setup_db.py

第一步:数据库设置 — 01_setup_db.py

import psycopg2
from dotenv import load_dotenv
import os

load_dotenv()

conn = psycopg2.connect(
    host=os.getenv("DB_HOST"),
    port=os.getenv("DB_PORT"),
    dbname=os.getenv("DB_NAME"),
    user=os.getenv("DB_USER"),
    password=os.getenv("DB_PASSWORD"),
)

cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute("""
    CREATE TABLE IF NOT EXISTS documents (
        id SERIAL PRIMARY KEY,
        title TEXT NOT NULL,
        body TEXT NOT NULL,
        category TEXT,
        created_at TIMESTAMP DEFAULT NOW(),
        embedding vector(768)
    );
""")
conn.commit()
print("Table created.")

python 01_setup_db.py

Why 768 dimensions? gemini-embedding-001 outputs 3072 dimensions by default, but pgvector’s HNSW index has a 2000-dimension limit. Setting output_dimensionality=768 keeps us well within that limit with negligible quality loss. 为什么是 768 维? gemini-embedding-001 默认输出 3072 维,但 pgvector 的 HNSW 索引有 2000 维的限制。设置 output_dimensionality=768 可以让我们保持在限制范围内,且质量损失微乎其微。


Step 2: HNSW Index — 02_create_index.py

第二步:HNSW 索引 — 02_create_index.py

import psycopg2
from dotenv import load_dotenv
import os

load_dotenv()

conn = psycopg2.connect(...) # (Same connection logic as above)
cur = conn.cursor()

cur.execute("""
    CREATE INDEX IF NOT EXISTS docs_embedding_idx 
    ON documents USING hnsw (embedding vector_cosine_ops) 
    WITH (m = 16, ef_construction = 64);
""")
conn.commit()
print("Index created.")

python 02_create_index.py

HNSW parameter reference: HNSW 参数参考:

Use casemef_construction
Dev / testing832
Production (standard)1664
High accuracy32128

Step 3: Ingest Documents — 03_ingest.py

第三步:文档入库 — 03_ingest.py

import psycopg2
from google import genai
from google.genai import types
from dotenv import load_dotenv
import os

load_dotenv()
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
# ... (Connection logic)

def get_embedding(text: str) -> list[float]:
    result = client.models.embed_content(
        model="gemini-embedding-001",
        contents=text,
        config=types.EmbedContentConfig(
            task_type="RETRIEVAL_DOCUMENT", # use RETRIEVAL_DOCUMENT for storage
            output_dimensionality=768,
        ),
    )
    return result.embeddings[0].values

def insert_document(title: str, body: str, category: str) -> int:
    embedding = get_embedding(f"{title}\n\n{body}")
    cur.execute("""
        INSERT INTO documents (title, body, category, embedding) 
        VALUES (%s, %s, %s, %s) RETURNING id;
    """, (title, body, category, embedding))
    doc_id = cur.fetchone()[0]
    conn.commit()
    return doc_id

# ... (Sample data insertion loop)

python 03_ingest.py

task_type matters: Use RETRIEVAL_DOCUMENT when storing and RETRIEVAL_QUERY when searching. This asymmetric setup improves retrieval accuracy. 任务类型很重要: 存储时使用 RETRIEVAL_DOCUMENT,搜索时使用 RETRIEVAL_QUERY。这种非对称设置可以提高检索准确性。


Step 4: Vector Search — 04_search.py

第四步:向量搜索 — 04_search.py

# ... (Setup client and connection)

def get_query_embedding(text: str) -> list[float]:
    result = client.models.embed_content(
        model="gemini-embedding-001",
        contents=text,
        config=types.EmbedContentConfig(
            task_type="RETRIEVAL_QUERY", # use RETRIEVAL_QUERY for search
            output_dimensionality=768,
        ),
    )
    return result.embeddings[0].values

def search(query: str, top_k: int = 3) -> list[dict]:
    query_embedding = get_query_embedding(query)
    cur.execute("""
        SELECT id, title, category, 1 - (embedding <=> %s::vector) AS similarity 
        FROM documents 
        ORDER BY embedding <=> %s::vector 
        LIMIT %s;
    """, (query_embedding, query_embedding, top_k))
    # ... (Return results)