Vector RAG Isn’t Enough — I Built a Context Graph Layer for Multi-Agent Memory

Vector RAG Isn’t Enough — I Built a Context Graph Layer for Multi-Agent Memory

向量 RAG 已不足够——我为多智能体记忆构建了一个上下文图层

Large Language Model Vector RAG Isn’t Enough — I Built a Context Graph Layer for Multi-Agent Memory. A structured context graph beat a flat chat dump and a vector-only RAG pipeline on the same multi-agent conversations — full working code, real benchmark numbers, zero API calls. 大语言模型向量 RAG 已不足够——我为多智能体记忆构建了一个上下文图层。在相同的多智能体对话测试中,结构化上下文图的表现优于扁平化聊天记录和纯向量 RAG 流水线——包含完整可运行代码、真实基准测试数据,且零 API 调用。

TL;DR: I wasn’t trying to build a new memory architecture. I was trying to understand why one agent kept forgetting decisions made by another. The benchmark came later. Multi-agent systems lose cross-agent decisions because flat transcripts and vector search both have a structural blind spot — not just a noise problem. A context graph stores facts as entities and relationships instead of text chunks, so it can answer questions that need two facts combined. 简而言之:我并非刻意要构建一种新的记忆架构,我只是想弄清楚为什么一个智能体总是忘记另一个智能体做出的决定。基准测试是后来才做的。多智能体系统会丢失跨智能体的决策,因为扁平化的转录文本和向量搜索都存在结构性盲点,这不仅仅是噪声问题。上下文图将事实存储为实体和关系,而非文本块,因此它能够回答需要结合两个事实才能得出的问题。

This is not a concept. Three memory architectures, five scripted scenarios, 18 graded queries, fully deterministic, zero LLM calls. 这不是纸上谈兵。三种记忆架构、五个脚本化场景、18 个评分查询,完全确定性,零 LLM 调用。

  • Context graph: 88.9% accuracy at 26.9 tokens/query.
  • Raw history dump: 61.1% accuracy at 490.9 tokens/query.
  • Vector-only RAG: 50.0% accuracy at 75.9 tokens/query.
  • 上下文图:准确率 88.9%,每查询消耗 26.9 个 token。
  • 原始历史记录:准确率 61.1%,每查询消耗 490.9 个 token。
  • 纯向量 RAG:准确率 50.0%,每查询消耗 75.9 个 token。

I found two real bugs building this — stale-fact retrieval and an entity-matching gap. Both are in the article. 在构建过程中,我发现了两个实际的 Bug——陈旧事实检索和实体匹配缺失。文中对此均有说明。

The Problem That Made Me Build This

促使我构建此系统的原因

I built a three-agent pipeline that worked great for short tasks. But the moment the conversation dragged on and an agent needed to recall a past decision, the whole thing fell apart. Here is exactly how it broke: Agent_Planner would decide the project should use PostgreSQL. Then, twenty turns of “sounds good” and “I’ll get to it” would pass. Eventually, Agent_Reviewer would pipe up and ask what storage technology we were using. Even with the entire raw transcript sitting right there in the context window, the agent couldn’t answer reliably. 我构建了一个三智能体流水线,在处理短任务时表现出色。但一旦对话拖长,且智能体需要回溯过去的决定时,整个系统就崩溃了。具体表现如下:Agent_Planner 决定项目应使用 PostgreSQL。随后经过了二十轮“听起来不错”、“我会处理的”之类的对话。最终,Agent_Reviewer 会突然插话询问我们正在使用什么存储技术。即使原始转录文本完整地存在于上下文窗口中,智能体也无法给出可靠的回答。

I was running this pipeline locally as a side project for EmiTechLogic just to see how far I could push multi-agent coordination before it hit a wall. Turns out, it didn’t take very long. Initially, I assumed this was just a model limitation. It isn’t. It is a memory architecture problem that usually triggers one of two massive headaches depending on how you try to fix it. 我将此流水线作为 EmiTechLogic 的副项目在本地运行,旨在测试多智能体协作在遇到瓶颈前能走多远。事实证明,瓶颈很快就出现了。起初,我以为这只是模型本身的局限性。其实不然,这是一个记忆架构问题,根据你尝试修复它的方式,通常会引发两种巨大的麻烦之一。

The Alternative Fix: Vector Search and the Relational Trap

替代方案:向量搜索与关系陷阱

If you switch to vector search, you fix the noise problem but immediately create a different one. A vector store retrieves chunks that look similar to your query; it doesn’t retrieve relationships between facts. If a key decision lives in one chunk and a critical dependency note about that decision lives in another, a similarity search has no way to combine them—no matter how good your embedding model is. Both approaches hit different structural ceilings. Instead of guessing which compromise was “good enough,” I decided to measure them both. 如果你转向向量搜索,虽然解决了噪声问题,但会立即产生另一个问题。向量存储检索的是与查询相似的文本块,它无法检索事实之间的关系。如果一个关键决策存在于一个文本块中,而关于该决策的关键依赖说明存在于另一个文本块中,那么无论你的嵌入模型有多好,相似度搜索都无法将它们结合起来。这两种方法都触及了不同的结构性上限。与其猜测哪种折中方案“足够好”,我决定对两者进行量化评估。

What This Problem Actually Is

问题的本质

To be clear about what this article is not: this isn’t a token-compression problem, and it’s not a staleness problem. It’s a structural retrieval problem. Some questions can only be answered by combining two separately-stated facts, and neither a growing context window nor a vector index has a mechanism to do that. That is a completely different failure mode than the ones I’ve written about before, and it needed a different benchmark. 明确一点,本文讨论的不是 token 压缩问题,也不是数据陈旧问题。这是一个结构化检索问题。有些问题只有通过结合两个独立陈述的事实才能回答,而不断增长的上下文窗口和向量索引都没有实现这一点的机制。这与我之前写过的故障模式完全不同,因此需要一套不同的基准测试。

The Test Setup

测试设置

To test this, I built five deterministic scenarios containing 18 graded queries and ran all three memory architectures against the exact same conversations. All the results below come from real runs of that benchmark using a localized setup: 为了测试这一点,我构建了五个确定性场景,包含 18 个评分查询,并针对完全相同的对话运行了所有三种记忆架构。以下所有结果均来自使用本地化设置进行的基准测试:

  • Environment: Python 3.12, CPU-only (no GPU needed)
  • API Calls: Zero
  • Consistency: Reproduced identically across two separate machines
  • Code Repo: You can find the complete implementation and run the tests yourself here: https://github.com/Emmimal/context-graph-benchmark/
  • 环境:Python 3.12,仅 CPU(无需 GPU)
  • API 调用:零
  • 一致性:在两台不同的机器上完全复现
  • 代码仓库:你可以在此处找到完整实现并自行运行测试:https://github.com/Emmimal/context-graph-benchmark/

What “Context Graph” Means Here

此处“上下文图”的含义

A flat memory store (whether it is a raw chat transcript or a vector index) treats every single turn as an independent unit of text. To retrieve something, you just find the unit that best matches your query. A context graph changes the underlying structure entirely. It treats memory as distinct entities with typed relationships connecting them: 扁平化记忆存储(无论是原始聊天记录还是向量索引)都将每一轮对话视为独立的文本单元。要检索内容,只需找到与查询最匹配的单元即可。而上下文图则彻底改变了底层结构。它将记忆视为具有类型化关系连接的独立实体:

  • AuthModule —–> DEPENDS_ON —–> RateLimiter
  • Agent_Implementer —–> ASSIGNED_TO —–> AuthModule

Retrieval in this model means traversing these relationships instead of just matching keywords or semantic vectors. That structural difference only matters for one specific class of questions: anything that requires you to combine two separately-stated facts. Consider a question like: “Which team owns the component that depends on the service that X chose?” There is no single answer chunk sitting anywhere in the raw conversation history. The answer does not exist as a block of text. It only exists as a path through multiple facts. A flat store cannot construct that path on the fly. A graph walks right through it. 在此模型中,检索意味着遍历这些关系,而不仅仅是匹配关键词或语义向量。这种结构差异仅对一类特定问题至关重要:任何需要结合两个独立陈述事实的问题。考虑这样一个问题:“哪个团队负责那个依赖于 X 所选服务的组件?”在原始对话历史中,没有任何单一的答案块。答案并不以文本块的形式存在,它仅作为跨越多个事实的路径存在。扁平化存储无法即时构建该路径,而图结构则可以直接遍历。

Who This Is For

适用人群

This approach is worth building if you run multi-agent pipelines where one agent’s decision must be correctly retrieved by a different agent many turns later. It is built for systems where questions routinely require combining two or more separately-stated facts, or any long-running agent conversation where the token cost of re-sending history is becoming a real line item. 如果你运行的多智能体流水线中,一个智能体的决策需要在多轮对话后被另一个智能体正确检索,那么这种方法就值得构建。它适用于那些问题通常需要结合两个或多个独立事实的系统,或者任何长期运行的智能体对话——在这种情况下,重新发送历史记录的 token 成本已成为一项显著开支。

You should skip it for single-agent, single-turn tasks because there is no cross-agent state to lose. Skip it if your queries are always single-fact lookups with no joins. Vector RAG gets you most of the accuracy there at a fraction of the engineering cost. Finally, skip it if your team has no tolerance for an extra moving part. A graph needs an extraction step (which is rule-based in this benchmark, but requires an LLM call in production) that a flat store avoids. If your multi-agent system finishes its work in a single exchange, plain context passing works fine. This problem shows up specifically when conversations run long and decisions need to survive past the turn they were made in. 对于单智能体、单轮任务,你应该跳过它,因为不存在跨智能体状态丢失的问题。如果你的查询总是单事实查找且无需关联,也请跳过它。向量 RAG 能以极低的工程成本实现大部分准确率。最后,如果你的团队无法容忍增加额外的组件,也请跳过它。图结构需要一个提取步骤(在本基准测试中是基于规则的,但在生产环境中需要 LLM 调用),而扁平化存储则无需此步骤。如果你的多智能体系统能在单次交互中完成工作,简单的上下文传递就足够了。只有当对话变长且决策需要在做出后的多轮中保持有效时,这个问题才会显现。