Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History
Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History
更少上下文,更高准确度:一种用于大模型智能体的双时态记忆引擎,实现精简检索胜过全量历史
Abstract: Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround — replaying the whole history into the prompt — is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources.
摘要: 长期记忆是大模型(LLM)智能体缺失的关键层:它们在不同会话间会发生遗忘,而常见的权宜之计——将全部历史记录回填至提示词中——不仅昂贵、缓慢,而且随着干扰信息的累积,准确度也会下降。大多数现有的记忆系统虽然在成本或延迟上表现优异,但在准确度上仍不及全上下文基准。此外,由于基准测试数据往往基于不一致且难以复现的评估框架,导致同一系统在不同来源中的评分差异巨大。
We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact — invalidating, never deleting, so every fact keeps provenance and a supersession chain.
我们提出了 Engram,这是一个基于双时态数据模型的开源双进程记忆引擎。其快速写入路径能够追加无损片段,且关键路径中无需调用大模型;异步路径则负责提取原子级(主语、谓语、宾语)事实,构建双时态知识图谱,并在无需为每个事实单独调用大模型的情况下解决矛盾——通过“失效”而非“删除”机制,确保每个事实都保留其来源及演替链。
A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time (“as-of”) filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram’s lean configuration — answering from a ~9.6k-token retrieved slice, never the full history — scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p < 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored.
混合读取路径融合了稠密向量、词法、图结构以及时效性/显著性信号,应用“截止时间”(as-of)过滤器,并组装出紧凑且带有来源标记的上下文。在包含 500 个问题的 LongMemEval_S 基准测试中,经官方特定类别评测器评分,Engram 的精简配置(仅基于约 9.6k token 的检索切片回答,而非全量历史)取得了 83.6% 的准确率,优于全上下文基准的 73.2%(提升 10.4 个百分点,McNemar p < 10^-6),且 token 消耗减少了约 8 倍(9.6k 对 79k),错误率为 0/500。
The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.
这种性能提升依赖于混合读取路径:仅靠事实检索会损失召回率,而事实与检索片段结合则能恢复细节。我们还贡献了一个中立的、内置于仓库的评估框架,其中集成了官方评测器,并在每个表格中列出了全上下文基准。我们发布了每个问题的原始日志,并记录了那些会悄无声息地扭曲记忆基准测试的测量完整性陷阱(如截断、自制评测器、全历史泄露等)。每一个数据点都附带了复现命令。