Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents
Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents
Supersede:诊断并训练大语言模型(LLM)智能体中的“记忆更新鸿沟”
Large language model (LLM) agents operate over long, multi-session interactions in which facts change: a user moves, a price updates, a plan is revised. Acting correctly requires using the current value of a fact and discarding values that have been superseded. We isolate this ability on real conversational data and show that it is a distinct, unsolved failure.
大语言模型(LLM)智能体通常在长期的多轮对话中运行,期间事实会发生变化:用户搬家了、价格更新了、计划被修订了。要做出正确的行动,智能体必须使用事实的最新值,并丢弃那些已被取代的旧值。我们在真实的对话数据上隔离了这一能力,并证明这是一种尚未解决的显著缺陷。
On the knowledge-update subset of LongMemEval, replacing an agent’s full context with a bounded, self-maintained memory drops accuracy from 92% to 77% even on a frontier model (gpt-5.4), a gap that is statistically significant (paired McNemar p<0.005) and persists across model scale while full-context accuracy saturates near 92%. The bottleneck is therefore memory maintenance, not comprehension, and is not closed by a stronger model.
在 LongMemEval 的知识更新子集上,即使是对于前沿模型(gpt-5.4),将智能体的完整上下文替换为有界的、自我维护的记忆,准确率也会从 92% 下降到 77%。这一差距在统计学上是显著的(配对 McNemar 检验 p<0.005),且在不同模型规模下持续存在,而完整上下文的准确率则趋于 92% 的饱和状态。因此,瓶颈在于记忆维护而非理解能力,且无法通过更强大的模型来解决。
We then ask whether this is merely an undersized memory, and find it is not: as the conversation grows 24x, accuracy falls further (from 68% to 28%), and granting the agent proportionally more memory yields no detectable recovery (28% to 28%, n=25). The failure scales with the length of the conversation, not the compression ratio.
我们随后探讨这是否仅仅是因为记忆容量不足,结果发现并非如此:随着对话长度增加 24 倍,准确率进一步下降(从 68% 降至 28%),且为智能体提供成比例增加的记忆容量也无法带来可检测的恢复(28% 到 28%,n=25)。这种失败随着对话长度的增加而扩大,而非取决于压缩比。
We release Supersede, an open reinforcement-learning environment (on the verifiers / prime-rl stack) that turns this measurement into a training signal: agents are rewarded for answering from the current value and penalized for stale ones. Finally, we close the loop and show the gap is trainable: GRPO fine-tuning a small open model (Qwen2.5-3B) on this environment nearly doubles its held-out supersession accuracy on real, unseen conversations (9.0% to 16.7%, a single run), along a monotonic checkpoint curve indicating the learned policy, not the harness, carries the gain.
我们发布了 Supersede,这是一个基于 verifiers / prime-rl 技术栈的开源强化学习环境,它将上述测量指标转化为训练信号:智能体因根据最新值回答而获得奖励,因使用陈旧值而受到惩罚。最后,我们闭环验证了这一鸿沟是可训练的:在这一环境下,通过 GRPO 微调小型开源模型(Qwen2.5-3B),其在真实、未见过的对话中的“取代准确率”(supersession accuracy)几乎翻了一番(从 9.0% 提升至 16.7%,单次运行),且检查点曲线呈单调上升,表明是习得的策略而非测试框架本身带来了性能提升。
To our knowledge this is the first trainable environment whose reward targets temporal fact-currency, and the first evidence the supersession gap can be trained down, not only measured.
据我们所知,这是第一个将奖励目标设定为“时间事实时效性”的可训练环境,也是第一个证明“取代鸿沟”不仅可以被测量,还可以通过训练来缩小的证据。