Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph
Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph
Regimes:一种基于 ActiveGraph 在 LongMemEval 上演示的可审计、留出验证门控改进循环
Abstract: Autonomous improvement loops are hard to trust because the improvement process is usually external scaffolding bolted onto the agent: failures go unlogged, diagnoses cannot be replayed, and promote-or-discard decisions land in a side database rather than the agent’s own history.
摘要: 自主改进循环往往难以获得信任,因为改进过程通常是附加在智能体外部的脚手架:故障未被记录,诊断过程无法重现,且“提升或丢弃”的决策往往存储在侧边数据库中,而非智能体自身的历史记录里。
We show that an event-sourced agent runtime removes that friction and turns controlled improvement into a first-class workflow. When the agent’s state is a deterministic projection of an append-only event log, failures are recorded, a run replays exactly from its log, candidate patches scope to typed pipeline seams, gates are auditable, and every promotion or discard is itself an event.
我们展示了一种基于事件溯源(event-sourced)的智能体运行时,它消除了这种阻碍,并将受控改进转化为一种一等工作流。当智能体的状态是仅追加事件日志的确定性投影时,故障会被记录,运行过程可根据日志精确重放,候选补丁被限定在类型化的流水线接口中,门控机制可审计,且每一次提升或丢弃操作本身也成为一个事件。
We demonstrate this with Regimes, a loop on the ActiveGraph runtime that diagnoses failed evaluations, proposes a repair at a pipeline point, and promotes it only after static checks, sandbox execution, in-sample evaluation, and held-out validation. The loop is target-agnostic: the same control flow runs against different tasks through a common interface.
我们通过 Regimes 演示了这一点,这是一个运行在 ActiveGraph 上的循环系统,它能诊断失败的评估,在流水线节点提出修复方案,并仅在通过静态检查、沙盒执行、样本内评估和留出验证(held-out validation)后才进行提升。该循环与目标无关:相同的控制流通过通用接口运行在不同的任务上。
On LongMemEval-S the dominant failure is not retrieval but reconciliation: the evidence is already in the assembled context, yet the reader answers incorrectly. Across five seeded held-out splits, Regimes discovers reader-prompt repairs that improve final held-out accuracy by +0.05 to +0.10 in four splits and +0.01 in one over-promotion split; two splits are individually significant, and the pooled count is descriptive only, since the splits share one 500-question pool.
在 LongMemEval-S 上,主要的失败原因并非检索,而是协调(reconciliation):证据已经存在于组装好的上下文中,但阅读器却给出了错误的答案。在五个种子留出数据集划分中,Regimes 发现了能改善阅读器提示的修复方案,在四个划分中将最终留出准确率提高了 +0.05 到 +0.10,在另一个过度提升的划分中提高了 +0.01;其中两个划分具有统计学显著性,而汇总计数仅具描述性,因为这些划分共享同一个 500 题的题库。
The durable contributions are ActiveGraph as an auditable substrate that makes controlled improvement loops tractable, the held-out-gated loop it supports, the failure-regime taxonomy routing each failure to a pipeline location, and the prompt-as-discovery-probe hypothesis.
本研究的持久贡献包括:作为可审计基底的 ActiveGraph,它使受控改进循环变得可行;其支持的留出验证门控循环;将每个故障路由至流水线特定位置的故障机制分类法;以及“提示即发现探针”(prompt-as-discovery-probe)的假设。