I Built a Private AI Assistant That Queries My Git History and Project Management Data — Using Only Local LLMs
I Built a Private AI Assistant That Queries My Git History and Project Management Data — Using Only Local LLMs
我构建了一个私有 AI 助手,仅使用本地大模型即可查询我的 Git 历史记录和项目管理数据
No API keys. No cloud. All data stays on my machine. 没有 API 密钥,没有云端服务,所有数据都保留在我的本地机器上。
The Problem
问题所在
As a web developer, I constantly need to answer questions like: “Who committed the most to our main repo this month?” “What files were changed for the last campaign launch?” “What project tasks are still in progress for the web team?” These answers exist — scattered across git log, project management boards, and my own memory. I was tired of digging through terminal output and clicking through boards manually. So I built a natural language interface that lets me ask these questions in plain English and get instant answers. 作为一名 Web 开发人员,我经常需要回答诸如“本月谁向主仓库提交的代码最多?”、“上次活动发布更改了哪些文件?”、“Web 团队还有哪些项目任务正在进行中?”之类的问题。这些答案分散在 git 日志、项目管理看板以及我自己的记忆中。我厌倦了手动翻阅终端输出和点击看板,因此我构建了一个自然语言界面,让我可以用简单的英语提问并获得即时答案。
The Architecture: Text-to-SQL, Not Vector RAG
架构:Text-to-SQL,而非向量 RAG
Here’s the key insight that shaped the entire project: My data is structured, not unstructured. Commits have authors, dates, and repos. Project tasks have statuses, deadlines, and assignees. This isn’t a pile of PDFs — it’s relational data that fits naturally into a SQLite database. Traditional RAG (vector embeddings + similarity search) is built for unstructured documents. For structured data, there’s a better approach: Text-to-SQL. 这是塑造整个项目的核心洞察:我的数据是结构化的,而非非结构化的。提交记录有作者、日期和仓库;项目任务有状态、截止日期和负责人。这不是一堆 PDF 文件,而是可以自然地放入 SQLite 数据库的关系型数据。传统的 RAG(向量嵌入 + 相似度搜索)是为非结构化文档构建的。对于结构化数据,有一种更好的方法:Text-to-SQL。
User Question ↓ Local LLM (generates SQL) ↓ SQLite Database (executes query) ↓ Local LLM (summarizes results) ↓ Human-readable Answer 用户提问 ↓ 本地 LLM(生成 SQL) ↓ SQLite 数据库(执行查询) ↓ 本地 LLM(总结结果) ↓ 人类可读的答案
The LLM doesn’t store or memorize my data. It just translates my question into SQL, runs it, and explains the results. LLM 不会存储或记忆我的数据。它只是将我的问题翻译成 SQL,运行它,并解释结果。
The Data Pipeline
数据流水线
Step 1: Collect everything into SQLite 第一步:将所有内容收集到 SQLite 中
I wrote two Python collectors that populate a single SQLite database: 我编写了两个 Python 收集器,用于填充单个 SQLite 数据库:
- Git history collector (collect.py): Runs git log across multiple repositories, stores commits, file changes, branches, and tags, and captures author, date, message, and insertions/deletions per file.
- Git 历史记录收集器 (collect.py): 在多个仓库中运行 git log,存储提交、文件更改、分支和标签,并捕获每个文件的作者、日期、提交信息以及插入/删除行数。
- Project management collector (collect_pm.py): Queries the project management platform’s GraphQL API (Monday.com in my case, but the pattern works for Jira, Linear, etc.), stores boards, items, and subitems, extracts status, assignee, department, and deadline, and flags web-team tasks automatically (is_web = 1).
- 项目管理收集器 (collect_pm.py): 查询项目管理平台的 GraphQL API(我使用的是 Monday.com,但该模式同样适用于 Jira、Linear 等),存储看板、项目和子项目,提取状态、负责人、部门和截止日期,并自动标记 Web 团队任务 (is_web = 1)。
The result: a single SQLite database holding everything needed to answer cross-cutting questions. 结果:一个单一的 SQLite 数据库,包含了回答跨领域问题所需的一切信息。
Step 2: Link git branches to project tasks 第二步:将 Git 分支与项目任务关联
This was the crucial step. Git branches like feature/example-promo-banner don’t obviously connect to project items like “Example Promo Banner — Launch”. I created a branch_task_map table that links them. This lets the system cross-reference: “What tasks relate to this branch?” or “What commits were made for this launch?”
这是关键的一步。像 feature/example-promo-banner 这样的 Git 分支并不能明显地与“Example Promo Banner — Launch”这样的项目条目关联起来。我创建了一个 branch_task_map 表来链接它们。这使得系统可以进行交叉引用:“哪些任务与此分支相关?”或“为这次发布做了哪些提交?”
The RAG System
RAG 系统
Why Ollama? Privacy was non-negotiable. Project data, commit messages, and task details shouldn’t leave the machine. Ollama runs the LLM entirely locally — no internet needed, no data sent anywhere. I chose qwen2.5-coder:7b as the model — it’s excellent at SQL generation and runs fast on Apple Silicon.
为什么选择 Ollama? 隐私是不可妥协的。项目数据、提交信息和任务详情不应离开机器。Ollama 完全在本地运行 LLM——无需互联网,数据不会发送到任何地方。我选择了 qwen2.5-coder:7b 作为模型——它在 SQL 生成方面表现出色,并且在 Apple Silicon 上运行速度很快。
The smart prompt: The system prompt includes the full database schema, sample values, few-shot SQL examples, and today’s date. 智能提示词: 系统提示词包含了完整的数据库架构、示例值、少样本 SQL 示例以及当前日期。
Auto-discovery: the secret sauce: Before the LLM even sees the question, the system extracts keywords and searches across all tables. This means when you ask “What’s happening with the example promo banner launch?”, the system has already found the matching project board, related branches, and recent commits. The LLM gets these exact values, so it writes precise SQL instead of guessing. 自动发现:秘诀所在: 在 LLM 看到问题之前,系统会提取关键词并搜索所有表。这意味着当你问“Example promo banner 发布进展如何?”时,系统已经找到了匹配的项目看板、相关分支和最近的提交。LLM 获得了这些精确的值,因此它编写的是精确的 SQL,而不是盲目猜测。
Self-correcting queries: If a SQL query returns 0 results, the system automatically retries with different keyword strategies. This handles the reality that commits are often on parent branches, not the feature branch itself. 自我修正查询: 如果 SQL 查询返回 0 个结果,系统会自动尝试不同的关键词策略进行重试。这处理了提交通常位于父分支而非功能分支本身这一现实情况。
The Result
结果
A CLI tool where I type questions and get answers: 一个我可以通过输入问题来获取答案的 CLI 工具:
$ python3 main.py "who committed the most this month?"
Developer A and Developer B lead this month with roughly 350 commits each...
$ python3 main.py "what web tasks are pending for the next launch?"
The upcoming launch has 8 web tasks remaining...
Key Takeaways
核心要点
- Not all RAG needs vectors. If your data is structured, Text-to-SQL is simpler and more accurate than embedding everything into a vector store.
- 并非所有 RAG 都需要向量。 如果你的数据是结构化的,Text-to-SQL 比将所有内容嵌入到向量数据库中更简单、更准确。
- Local LLMs are production-ready. Ollama +
qwen2.5-coder:7bruns fast on a MacBook and generates correct SQL reliably.- 本地 LLM 已具备生产就绪能力。 Ollama +
qwen2.5-coder:7b在 MacBook 上运行速度很快,且能可靠地生成正确的 SQL。
- 本地 LLM 已具备生产就绪能力。 Ollama +
- Auto-discovery beats prompt engineering. Instead of hoping the LLM guesses the right table, provide the context explicitly.
- 自动发现优于提示词工程。 不要指望 LLM 能猜出正确的表,直接提供上下文会更好。