Building a RAG Pipeline From Scratch: What SmartQueue Taught Me About Retrieval

Building a RAG Pipeline From Scratch: What SmartQueue Taught Me About Retrieval

从零构建 RAG 流水线:SmartQueue 在检索方面教会我的事

When I set out to add an AI assistant to SmartQueue, a distributed task queue I’d already built in Go for handling IT support tickets, the obvious move was to bolt on an LLM and call it done. Type a question, get an answer. But a generic LLM doesn’t know your company’s password reset procedure, your P1 outage runbook, or that refunds need manager approval above $500. It needed grounding in actual internal knowledge. That’s the job retrieval-augmented generation (RAG) is built for: pull the relevant facts out of your own documents first, then hand them to the model as context instead of trusting it to know your business.

当我着手为 SmartQueue(我之前用 Go 构建的一个用于处理 IT 支持工单的分布式任务队列)添加 AI 助手时,最显而易见的做法就是直接接入一个大语言模型(LLM)并大功告成。输入问题,得到答案。但通用的 LLM 并不了解你公司的密码重置流程、P1 级故障处理手册,也不知道超过 500 美元的退款需要经理审批。它需要基于实际的内部知识。这正是检索增强生成(RAG)的工作:先从你自己的文档中提取相关事实,然后将其作为上下文交给模型,而不是盲目信任模型能了解你的业务。

This post walks through how that pipeline actually works, the architectural decision I reversed midway through (and why), the numbers I picked for things like retrieval depth and temperature, and an honest take on whether any of it counts as “real” RAG.

本文将详细介绍该流水线的工作原理、我在中途推翻的架构决策(及其原因)、我在检索深度和温度等参数上的选择,以及关于这是否算作“真正”RAG 的诚恳看法。

What the assistant actually does

助手的功能实现

SmartQueue Bot lives inside the Queue Health and AI Bot tabs of the dashboard. An agent picks a ticket, asks a question like “what are the immediate steps for this database outage,” and the bot streams back an answer token by token, grounded in a small internal knowledge base of IT runbooks. The request flow looks like this:

SmartQueue Bot 位于仪表板的“队列健康”和“AI 机器人”选项卡中。客服人员选择一个工单,询问诸如“针对此次数据库故障的即时步骤是什么”之类的问题,机器人会基于小型的内部 IT 手册知识库,逐个 token 流式传输回答案。请求流程如下:

  • agent question

  • v prompt-injection check (regex guardrails)

  • v BM25 search over 10 runbooks —> top 4 matches

  • v system prompt assembled: ticket context + runbook excerpts

  • v Groq (LLaMA 3.3 70B) streamed via SSE, with last 10 turns of session history

  • v response streamed to client + written back to Redis session memory

  • 客服提问

  • v 提示词注入检查(正则表达式防护)

  • v 在 10 本手册中进行 BM25 搜索 —> 获取前 4 个匹配项

  • v 组装系统提示词:工单上下文 + 手册摘录

  • v 通过 SSE 流式传输 Groq (LLaMA 3.3 70B),包含最近 10 轮会话历史

  • v 响应流式传输至客户端 + 写入 Redis 会话内存

Three things happen before any text reaches the model: the user’s message is checked for prompt injection attempts, the message is used as a query against the knowledge base, and the top matches get woven into a system prompt alongside the ticket’s category, priority, and description. The model never sees raw documents without that framing. It sees a structured brief.

在任何文本到达模型之前,会发生三件事:检查用户消息是否存在提示词注入尝试;将消息作为查询在知识库中检索;将匹配度最高的内容与工单的类别、优先级和描述一起编织进系统提示词中。模型永远不会在没有这种框架的情况下看到原始文档,它看到的是一份结构化的简报。

The decision I reversed: ChromaDB, then BM25

我推翻的决策:从 ChromaDB 到 BM25

The first version of the knowledge base used ChromaDB with its default ONNX embedding function: proper vector search, no torch dependency, queried through a thread pool so it wouldn’t block the event loop. That’s the textbook RAG setup, and it worked locally. It fell apart the moment I tried to deploy the whole stack as a single container on Hugging Face Spaces.

知识库的第一个版本使用了带有默认 ONNX 嵌入函数的 ChromaDB:标准的向量搜索,没有 torch 依赖,通过线程池查询以避免阻塞事件循环。这是教科书式的 RAG 设置,在本地运行良好。但当我尝试将整个堆栈作为单个容器部署到 Hugging Face Spaces 时,它崩溃了。

The deployment used supervisord to run Redis, the Go API, two Go worker replicas, and the FastAPI AI service all inside one container, and originally a separate ChromaDB process alongside them. That’s five long-running processes competing for a small amount of memory and CPU in a free-tier container, with supervisord responsible for starting them in the right order and keeping them alive. ChromaDB was the one that kept causing startup races and silent failures.

该部署使用 supervisord 在一个容器内运行 Redis、Go API、两个 Go 工作进程副本以及 FastAPI AI 服务,最初还包含一个独立的 ChromaDB 进程。这意味着五个长期运行的进程在免费层级的容器中争夺有限的内存和 CPU,而 supervisord 负责按正确顺序启动并保持它们存活。ChromaDB 总是导致启动竞争和静默失败。

After enough commits with messages like “fix: remove ChromaDB from supervisord” and “fix: replace ChromaDB with in-memory BM25 search,” I made the call to rip it out entirely. The replacement is about 50 lines of pure Python, with no embedding model, no external process, and no network call.

在提交了诸如“修复:从 supervisord 中移除 ChromaDB”和“修复:用内存中 BM25 搜索替换 ChromaDB”等多次提交后,我决定彻底弃用它。替代方案大约只有 50 行纯 Python 代码,没有嵌入模型,没有外部进程,也没有网络调用。

(Code snippet omitted for brevity)

This is the standard Okapi BM25 formula, computed fresh against the in-memory runbook corpus on every query. No index to build, no daemon to keep alive, no embedding latency on cold start. The trade-off is real: BM25 only matches on term overlap, so a query phrased very differently from the runbook’s wording (synonyms, paraphrasing) won’t score well. But for a fixed set of 10 short, keyword-dense IT runbooks where users are typically searching with the same vocabulary the runbooks use (“VPN,” “password reset,” “outage”), that weakness barely shows up in practice. The thing that mattered more than retrieval quality at this scale was that the service now starts reliably every single time.

这是标准的 Okapi BM25 公式,在每次查询时针对内存中的手册语料库实时计算。无需构建索引,无需守护进程,冷启动时也没有嵌入延迟。权衡是显而易见的:BM25 仅匹配术语重叠,因此如果查询的措辞与手册差异很大(如同义词、改写),得分就不会高。但对于 10 本简短且关键词密集的 IT 手册来说,用户通常使用与手册相同的词汇(如“VPN”、“密码重置”、“故障”)进行搜索,这种弱点在实践中几乎不会显现。在这个规模下,比检索质量更重要的是服务现在每次都能可靠地启动。

The numbers, and why those numbers

参数及其原因

A few of the constants in this pipeline were deliberate tuning decisions rather than defaults I left untouched. None of this is a RAGAS-style evaluation with precision/recall/faithfulness scores. There’s no eval harness here, just systems-level tuning based on the constraints I was working under (a free-tier LLM provider, a single demo container, and a knowledge base that doesn’t change).

流水线中的一些常量是我经过深思熟虑的调整结果,而不是未加改动的默认值。这并非 RAGAS 风格的精确度/召回率/忠实度评估。这里没有评估工具,只有基于我所处约束条件(免费层级 LLM 提供商、单个演示容器、不变的知识库)进行的系统级调优。

ConstantValueWhy
Retrieved docs (k)4Enough context to cover the answer without bloating the prompt
BM25 k1 / b1.5 / 0.75Standard Robertson defaults
Bot temperature0.2Troubleshooting answers should be literal and repeatable
Classifier temperature0.1Output is parsed as JSON; near-deterministic
Recommender temperature0.3Slightly more room for reasoning over queue state
Bot max_tokens800Long enough for guidance, short enough for snappy streaming
Classifier max_tokens250Schema is small, just eight short fields
Session history window10 turnsEnough continuity without memory growing unbounded
Rate limit30 req/minProtects free Groq quota
LLM client retries0Fallbacks already exist; retrying just adds latency
常量数值原因
检索文档数 (k)4提供足够的上下文以覆盖答案,且不会使提示词臃肿
BM25 k1 / b1.5 / 0.75标准的 Robertson 默认值
机器人温度0.2故障排除答案应是字面化且可重复的
分类器温度0.1输出被解析为 JSON;近乎确定性
推荐器温度0.3为推理队列状态留出更多空间
机器人 max_tokens800足够长以提供指导,足够短以保持流式传输响应速度
分类器 max_tokens250模式很小,仅八个短字段
会话历史窗口10 轮保持足够的连续性,且内存不会无限增长
速率限制30 次/分钟保护免费的 Groq 配额
LLM 客户端重试0已存在回退机制;重试只会增加延迟

That last one is worth dwelling on. Every AI-backed endpoint in this system has a non-LLM fallback.

最后一点值得深思。该系统中的每个 AI 支持的端点都有一个非 LLM 的回退机制。