How to Build a Powerful LLM Knowledge Base

How to Build a Powerful LLM Knowledge Base

如何构建强大的大模型(LLM)知识库

In this article, I’ll discuss how to build a knowledge base powered by LLMs. A knowledge base is a concept where you store a lot of information, and you make it accessible for future use. This is incredibly powerful for: Better decision-making, Quickly picking up on past context, and Aligning your team. 在本文中,我将探讨如何构建一个由大模型(LLM)驱动的知识库。知识库是一个存储大量信息并使其可供未来使用的概念。它在以下方面具有强大的作用:更好的决策制定、快速获取过往背景信息,以及实现团队协作对齐。

Lately, I’ve started working a lot with setting up a knowledge base and routing as much context as possible into it to help me improve all of the points above. Knowledge bases were always useful even before LLMs, because it’s always useful to access past knowledge. However, the knowledge bases have grown exponentially more powerful because of LLMs. This is because of two main reasons: You can capture more information in the knowledge bases; You can more easily query the knowledge base (you don’t have to look through it manually). 最近,我开始投入大量精力搭建知识库,并尽可能多地将背景信息导入其中,以帮助我提升上述所有能力。即使在 LLM 出现之前,知识库也一直很有用,因为获取过往知识总是很有价值。然而,由于 LLM 的出现,知识库的功能得到了指数级的增强。这主要有两个原因:你可以在知识库中捕获更多信息;你可以更轻松地查询知识库(无需手动翻阅)。

In this article, I’ll cover why you should set up your own LLM-powered knowledge base, how to capture as much information as possible, and how to actively use the knowledge base. I’ll discuss how to build a knowledge base powered by coding agents, why you should do it, how to route information into it, and how to use that information during inference. 在本文中,我将介绍为什么要建立自己的 LLM 驱动的知识库,如何尽可能多地捕获信息,以及如何主动使用该知识库。我将讨论如何构建由编程智能体(Coding Agents)驱动的知识库,为什么要这样做,如何将信息导入其中,以及如何在推理过程中使用这些信息。

I’ve been discussing this topic a bit before, but I have grown more and more fond of the topic of knowledge bases because of how popular it’s become. You, for example, have the president of Y Combinator building GBrain, or Andrej Karpathy building an LLM wiki, which are both examples of knowledge bases. There is, of course, no ground truth for the optimal way to build a knowledge base. I think the most important thing is to actually start storing all of your context into a knowledge base and figuring out how to query the knowledge base effectively all the time, for example, when writing code, in meetings, or similar. 我之前讨论过这个话题,但随着知识库变得越来越流行,我对其兴趣也日益浓厚。例如,Y Combinator 的总裁正在构建 GBrain,Andrej Karpathy 也在构建 LLM 维基,这些都是知识库的典型案例。当然,构建知识库并没有所谓的“标准答案”。我认为最重要的是真正开始将你所有的背景信息存储到知识库中,并不断探索如何有效地查询它,例如在编写代码、开会或进行类似活动时。

Why you should have a knowledge base

为什么你需要一个知识库

First of all, I’d like to cover why you should have a knowledge base. You can have different knowledge bases. For example, you can have a personal one consisting of all the context that you have personally, or you can have a company-wide knowledge base consisting of knowledge or context that the company possesses. The reason you should have a knowledge base is that information is extremely valuable. The more information you can store and then later access when needed, the better you will perform. 首先,我想谈谈为什么你应该拥有一个知识库。你可以拥有不同类型的知识库。例如,你可以拥有一个包含你个人所有背景信息的个人知识库,或者一个包含公司所拥有的知识或背景信息的公司级知识库。你需要知识库的原因在于信息极其宝贵。你能存储的信息越多,并在需要时随时调用,你的表现就会越好。

I also believe that these knowledge bases have become far more powerful because you can query them with LLMs. Previously, you would have had to manually look through the knowledge base to find relevant information. You would have to use your own memory to recall if a certain piece of information was stored in the knowledge base and then decide whether to spend time finding that information or not. Now that is completely turned around. The LLM can itself query the knowledge base, for example, with a RAG-type approach, and automatically find relevant information immediately. 我还认为,由于可以使用 LLM 进行查询,这些知识库变得强大得多。以前,你必须手动翻阅知识库来查找相关信息。你得靠自己的记忆力去回想某条信息是否存储在知识库中,然后决定是否花时间去寻找它。现在情况完全不同了。LLM 可以自行查询知识库(例如通过 RAG 检索增强生成技术),并自动立即找到相关信息。

Capturing information into the knowledge base

将信息捕获到知识库中

The first step of the knowledge base is, of course, to capture information into the knowledge base. Depending on how your knowledge base is built up, this can happen in a variety of different ways. However, the first thing I urge you to do is to think of all the different sources of information that you have access to, either personally or at the company. These are, for example: Meetings, Your project management tool (such as Linear), Your coding agent (such as Claude Code or Codex), and Physical office discussions. 构建知识库的第一步当然是将信息捕获进去。根据知识库的构建方式,这可以通过多种不同的方式实现。然而,我首先建议你做的是,列出你个人或公司能够接触到的所有信息来源。例如:会议记录、项目管理工具(如 Linear)、编程智能体(如 Claude Code 或 Codex)以及办公室里的口头讨论。

The point is that you should map out all these different information sources, and you should figure out an automatic way to route information from these sources into your knowledge base. You and other people will not be willing to spend more time manually putting things into knowledge bases. You need to figure out a way to automatically do this to have your knowledge base up to date. It’s important that you fully automate the routing of information from the source to the knowledge base. 关键在于,你应该梳理出所有这些不同的信息来源,并找到一种自动化的方式将信息从这些来源导入知识库。你和其他人都不愿意花更多时间手动将内容录入知识库。你需要找到一种自动化的方法来保持知识库的实时更新。至关重要的是,你要实现从源头到知识库的信息流转的完全自动化。

If you require a manual step (for example, pasting meeting notes into the knowledge base), you’ll definitely forget about it and lose important context, which goes against the entire concept of the knowledge base. The whole point of the knowledge base is that you store absolutely all information there and don’t leave anything out. That’s what makes a knowledge base so powerful. For example, with meeting notes, you can have a cron job that syncs daily. It takes each meeting note that everyone in the company has had or that you have had personally, and stores it in a knowledge base. 如果你需要手动操作(例如将会议记录粘贴到知识库中),你肯定会忘记,从而丢失重要的背景信息,这违背了知识库的初衷。知识库的核心意义在于存储所有信息,不遗漏任何内容。这正是知识库强大的原因。例如,对于会议记录,你可以设置一个每日同步的定时任务(cron job)。它会自动抓取公司内每个人或你个人参与的每一份会议记录,并将其存储到知识库中。