Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

更少上下文，更高准确度：一种用于大模型智能体的双时态记忆引擎，实现精简检索胜过全量历史

Abstract: Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround — replaying the whole history into the prompt — is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources.

摘要： 长期记忆是大模型（LLM）智能体缺失的关键层：它们在不同会话间会发生遗忘，而常见的权宜之计——将全部历史记录回填至提示词中——不仅昂贵、缓慢，而且随着干扰信息的累积，准确度也会下降。大多数现有的记忆系统虽然在成本或延迟上表现优异，但在准确度上仍不及全上下文基准。此外，由于基准测试数据往往基于不一致且难以复现的评估框架，导致同一系统在不同来源中的评分差异巨大。

We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact — invalidating, never deleting, so every fact keeps provenance and a supersession chain.

我们提出了 Engram，这是一个基于双时态数据模型的开源双进程记忆引擎。其快速写入路径能够追加无损片段，且关键路径中无需调用大模型；异步路径则负责提取原子级（主语、谓语、宾语）事实，构建双时态知识图谱，并在无需为每个事实单独调用大模型的情况下解决矛盾——通过“失效”而非“删除”机制，确保每个事实都保留其来源及演替链。

A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time (“as-of”) filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram’s lean configuration — answering from a ~9.6k-token retrieved slice, never the full history — scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p < 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored.

混合读取路径融合了稠密向量、词法、图结构以及时效性/显著性信号，应用“截止时间”（as-of）过滤器，并组装出紧凑且带有来源标记的上下文。在包含 500 个问题的 LongMemEval_S 基准测试中，经官方特定类别评测器评分，Engram 的精简配置（仅基于约 9.6k token 的检索切片回答，而非全量历史）取得了 83.6% 的准确率，优于全上下文基准的 73.2%（提升 10.4 个百分点，McNemar p < 10^-6），且 token 消耗减少了约 8 倍（9.6k 对 79k），错误率为 0/500。

The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.

这种性能提升依赖于混合读取路径：仅靠事实检索会损失召回率，而事实与检索片段结合则能恢复细节。我们还贡献了一个中立的、内置于仓库的评估框架，其中集成了官方评测器，并在每个表格中列出了全上下文基准。我们发布了每个问题的原始日志，并记录了那些会悄无声息地扭曲记忆基准测试的测量完整性陷阱（如截断、自制评测器、全历史泄露等）。每一个数据点都附带了复现命令。