Search Discipline for Long-Horizon Research Agents

面向长周期研究智能体的搜索准则

Abstract: Autoresearch agents now propose, evaluate, and select scientific candidates against a metric, and that metric is usually an aggregate reduced over a heterogeneous space of regions, slices, or cohorts. We show that when scientific validity lives in that disaggregated structure, the aggregate can rank the wrong candidate first. The headline number improves while the structure underneath inverts, so a decision made on the number accepts a candidate that quietly breaks the model.

摘要： 自动研究智能体目前能够针对特定指标提出、评估并筛选科学候选方案，而这些指标通常是对异构空间（如不同区域、切片或群体）进行聚合降维后的结果。我们研究发现，当科学有效性存在于这种细分结构中时，聚合指标可能会导致错误的候选方案被排在首位。此时，虽然核心指标数值有所提升，但底层的结构却发生了倒置，导致基于该指标做出的决策接受了一个实际上破坏了模型的候选方案。

The failure is not domain-specific. It appears wherever a candidate’s validity is multi-dimensional but its verifier is a single reduction. We demonstrate the inversion on a fire-model task in the Ecosystem Demography model. The highest-scoring candidate and a slightly lower one are within noise of each other on global score, yet the top-scoring one collapses the protected boreal regions while the other preserves them. What separates them is the per-region behavior, not the headline number.

这种失效并非特定于某个领域。只要候选方案的有效性是多维的，而验证器仅采用单一的降维指标，这种问题就会出现。我们在生态系统人口统计模型（Ecosystem Demography model）的火灾模拟任务中演示了这种倒置现象。得分最高的候选方案与得分稍低的方案在全局分数上几乎处于噪声范围内，但得分最高的方案会导致受保护的北方森林区域崩溃，而另一个方案则能保护它们。区分它们的关键在于各区域的具体表现，而非核心指标数值。

This decision should not be left to the agent that produced the candidates. The agent optimizing the score is the last party likely to catch the score being wrong, and a prompt has no remaining turn once the agent has stopped. We move the decision to an external control loop that audits each candidate on its disaggregated behavior and acts after the agent has decided. It can demote a candidate the agent would have accepted, and it can reopen a run the agent had declared finished. Our contribution is the inversion finding itself, and a search-discipline protocol that decides on reviewable candidate-effect evidence instead of the score.

这一决策不应交给生成候选方案的智能体。优化分数的智能体最不可能发现分数本身存在错误，且一旦智能体停止运行，提示词（prompt）也就失去了后续干预的机会。我们将决策权转移到一个外部控制循环中，该循环会对每个候选方案的细分行为进行审计，并在智能体做出决定后采取行动。它可以降级智能体本已接受的方案，也可以重启智能体已宣布完成的运行任务。我们的贡献在于发现了这种倒置现象，并提出了一套搜索准则协议，该协议基于可审查的候选方案影响证据，而非单纯的指标分数来做出决策。