Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

你的智能体也在“衰老”:部署系统的智能体寿命工程

Abstract: Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent’s effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model.

摘要: 长寿命 AI 智能体正越来越多地作为持久运行系统被部署,然而它们目前的评估方式仍如同刚初始化的模型一般。首日基准测试忽略了一个基本的系统性问题:智能体在部署后能保持多久的可靠性?即使模型权重被冻结,智能体的有效状态也会随着其压缩交互历史、从不断增长的记忆库中检索信息、在更新后修正事实以及进行日常维护而不断变化。因此,可靠性已成为整个智能体架构的“寿命属性”,而不仅仅是基础模型的“快照属性”。

We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline.

我们引入了 AgingBench,这是一个用于智能体寿命工程的纵向可靠性基准:它不仅衡量部署的智能体是否会退化,还衡量退化的形式以及修复应针对的目标。AgingBench 将智能体衰老归纳为四种机制:压缩衰老、干扰衰老、修正衰老和维护衰老。为了诊断这些故障,AgingBench 使用时间依赖图和配对的反事实探针,为记忆流水线的写入、检索和利用阶段生成诊断配置文件。

Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.

通过 7 个场景、14 个模型、多种记忆策略以及运行器控制和自主智能体的测试,跨越 8 到 200 个会话的约 400 次运行表明,智能体衰老并非单一维度的:行为测试可能表现正常,但事实精确度却在下降;派生状态跟踪可能在单个模型内急剧崩溃;同一个错误答案可能需要根据诊断配置文件的指向进行不同的修复。这些结果表明,可靠的智能体部署需要寿命评估、机制级诊断和阶段性针对修复,而不仅仅是更强大的“首日模型”。