Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph

Regimes：一种基于 ActiveGraph 在 LongMemEval 上演示的可审计、留出验证门控改进循环

Abstract: Autonomous improvement loops are hard to trust because the improvement process is usually external scaffolding bolted onto the agent: failures go unlogged, diagnoses cannot be replayed, and promote-or-discard decisions land in a side database rather than the agent’s own history.

摘要： 自主改进循环往往难以获得信任，因为改进过程通常是附加在智能体外部的脚手架：故障未被记录，诊断过程无法重现，且“提升或丢弃”的决策往往存储在侧边数据库中，而非智能体自身的历史记录里。

We show that an event-sourced agent runtime removes that friction and turns controlled improvement into a first-class workflow. When the agent’s state is a deterministic projection of an append-only event log, failures are recorded, a run replays exactly from its log, candidate patches scope to typed pipeline seams, gates are auditable, and every promotion or discard is itself an event.

我们展示了一种基于事件溯源（event-sourced）的智能体运行时，它消除了这种阻碍，并将受控改进转化为一种一等工作流。当智能体的状态是仅追加事件日志的确定性投影时，故障会被记录，运行过程可根据日志精确重放，候选补丁被限定在类型化的流水线接口中，门控机制可审计，且每一次提升或丢弃操作本身也成为一个事件。

We demonstrate this with Regimes, a loop on the ActiveGraph runtime that diagnoses failed evaluations, proposes a repair at a pipeline point, and promotes it only after static checks, sandbox execution, in-sample evaluation, and held-out validation. The loop is target-agnostic: the same control flow runs against different tasks through a common interface.

我们通过 Regimes 演示了这一点，这是一个运行在 ActiveGraph 上的循环系统，它能诊断失败的评估，在流水线节点提出修复方案，并仅在通过静态检查、沙盒执行、样本内评估和留出验证（held-out validation）后才进行提升。该循环与目标无关：相同的控制流通过通用接口运行在不同的任务上。

On LongMemEval-S the dominant failure is not retrieval but reconciliation: the evidence is already in the assembled context, yet the reader answers incorrectly. Across five seeded held-out splits, Regimes discovers reader-prompt repairs that improve final held-out accuracy by +0.05 to +0.10 in four splits and +0.01 in one over-promotion split; two splits are individually significant, and the pooled count is descriptive only, since the splits share one 500-question pool.

在 LongMemEval-S 上，主要的失败原因并非检索，而是协调（reconciliation）：证据已经存在于组装好的上下文中，但阅读器却给出了错误的答案。在五个种子留出数据集划分中，Regimes 发现了能改善阅读器提示的修复方案，在四个划分中将最终留出准确率提高了 +0.05 到 +0.10，在另一个过度提升的划分中提高了 +0.01；其中两个划分具有统计学显著性，而汇总计数仅具描述性，因为这些划分共享同一个 500 题的题库。

The durable contributions are ActiveGraph as an auditable substrate that makes controlled improvement loops tractable, the held-out-gated loop it supports, the failure-regime taxonomy routing each failure to a pipeline location, and the prompt-as-discovery-probe hypothesis.

本研究的持久贡献包括：作为可审计基底的 ActiveGraph，它使受控改进循环变得可行；其支持的留出验证门控循环；将每个故障路由至流水线特定位置的故障机制分类法；以及“提示即发现探针”（prompt-as-discovery-probe）的假设。