Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents

Supersede：诊断并训练大语言模型（LLM）智能体中的“记忆更新鸿沟”

Large language model (LLM) agents operate over long, multi-session interactions in which facts change: a user moves, a price updates, a plan is revised. Acting correctly requires using the current value of a fact and discarding values that have been superseded. We isolate this ability on real conversational data and show that it is a distinct, unsolved failure.

大语言模型（LLM）智能体通常在长期的多轮对话中运行，期间事实会发生变化：用户搬家了、价格更新了、计划被修订了。要做出正确的行动，智能体必须使用事实的最新值，并丢弃那些已被取代的旧值。我们在真实的对话数据上隔离了这一能力，并证明这是一种尚未解决的显著缺陷。

On the knowledge-update subset of LongMemEval, replacing an agent’s full context with a bounded, self-maintained memory drops accuracy from 92% to 77% even on a frontier model (gpt-5.4), a gap that is statistically significant (paired McNemar p<0.005) and persists across model scale while full-context accuracy saturates near 92%. The bottleneck is therefore memory maintenance, not comprehension, and is not closed by a stronger model.

在 LongMemEval 的知识更新子集上，即使是对于前沿模型（gpt-5.4），将智能体的完整上下文替换为有界的、自我维护的记忆，准确率也会从 92% 下降到 77%。这一差距在统计学上是显著的（配对 McNemar 检验 p<0.005），且在不同模型规模下持续存在，而完整上下文的准确率则趋于 92% 的饱和状态。因此，瓶颈在于记忆维护而非理解能力，且无法通过更强大的模型来解决。

We then ask whether this is merely an undersized memory, and find it is not: as the conversation grows 24x, accuracy falls further (from 68% to 28%), and granting the agent proportionally more memory yields no detectable recovery (28% to 28%, n=25). The failure scales with the length of the conversation, not the compression ratio.

我们随后探讨这是否仅仅是因为记忆容量不足，结果发现并非如此：随着对话长度增加 24 倍，准确率进一步下降（从 68% 降至 28%），且为智能体提供成比例增加的记忆容量也无法带来可检测的恢复（28% 到 28%，n=25）。这种失败随着对话长度的增加而扩大，而非取决于压缩比。

We release Supersede, an open reinforcement-learning environment (on the verifiers / prime-rl stack) that turns this measurement into a training signal: agents are rewarded for answering from the current value and penalized for stale ones. Finally, we close the loop and show the gap is trainable: GRPO fine-tuning a small open model (Qwen2.5-3B) on this environment nearly doubles its held-out supersession accuracy on real, unseen conversations (9.0% to 16.7%, a single run), along a monotonic checkpoint curve indicating the learned policy, not the harness, carries the gain.

我们发布了 Supersede，这是一个基于 verifiers / prime-rl 技术栈的开源强化学习环境，它将上述测量指标转化为训练信号：智能体因根据最新值回答而获得奖励，因使用陈旧值而受到惩罚。最后，我们闭环验证了这一鸿沟是可训练的：在这一环境下，通过 GRPO 微调小型开源模型（Qwen2.5-3B），其在真实、未见过的对话中的“取代准确率”（supersession accuracy）几乎翻了一番（从 9.0% 提升至 16.7%，单次运行），且检查点曲线呈单调上升，表明是习得的策略而非测试框架本身带来了性能提升。

To our knowledge this is the first trainable environment whose reward targets temporal fact-currency, and the first evidence the supersession gap can be trained down, not only measured.

据我们所知，这是第一个将奖励目标设定为“时间事实时效性”的可训练环境，也是第一个证明“取代鸿沟”不仅可以被测量，还可以通过训练来缩小的证据。