Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents

知道何时提问：分层语言智能体的自门控澄清机制

Abstract: In hierarchical reasoning, failures often originate at intermediate decision points where the agent commits to a wrong branch without recognizing that it lacks critical information. Rather than treating clarification as an external uncertainty trigger, we propose ACTION-RATING, a formulation that places it inside the agent’s action space on a shared ordinal scale with navigation, so that asking competes directly with acting at every decision point and help-seeking becomes observable at intermediate states.

摘要： 在分层推理中，失败往往源于中间决策点，此时智能体在未意识到缺乏关键信息的情况下，错误地选择了分支。我们没有将“澄清”（clarification）视为外部的不确定性触发器，而是提出了 ACTION-RATING。这是一种将澄清置于智能体动作空间内的公式化方法，使其与导航动作处于同一序数尺度上。这样，在每个决策点，“提问”都会与“行动”直接竞争，从而使寻求帮助的行为在中间状态下变得可观测。

Two structurally distinct information-seeking modes emerge from the agent’s own ratings: mandatory (no viable branch) and opportunistic (residual uncertainty despite a leading candidate). On Harmonized Tariff Schedule classification (30,000-node taxonomy, three benchmarks, 9~LLMs across 4 families), we observe a regime shift from mandatory to opportunistic clarification, with Information-Seeking Effectiveness (ISE), a local diagnostic defined as the fraction of help interactions followed by a correct next navigation step (not a final-task metric), rising from 50% to 74%.

从智能体自身的评分中，出现了两种结构上截然不同的信息寻求模式：强制性模式（无可行分支）和机会性模式（尽管有主要候选方案但仍存在剩余不确定性）。在协调关税表（Harmonized Tariff Schedule）分类任务（包含 30,000 个节点的分类体系、三个基准测试以及跨 4 个系列的 9 个大语言模型）中，我们观察到从强制性澄清到机会性澄清的机制转变。信息寻求有效性（ISE，一种局部诊断指标，定义为寻求帮助后紧接着正确导航步骤的比例，而非最终任务指标）从 50% 上升到了 74%。

Three diagnostic contrasts fail to reproduce this structure. A separability test shows that the information-seeking pattern (mode split, ISE ranking) persists when answer quality is degraded (-18.8% accuracy), supporting an empirical separation between where an agent seeks help and the quality of the help it receives. Under the controlled answer channel, accuracy gains reach +16.2% at 10-digit; we read this as an upper bound on what better localization could unlock, not a deployment estimate.

三种诊断对比实验未能复现这一结构。可分离性测试表明，即使在答案质量下降（准确率降低 18.8%）时，信息寻求模式（模式划分、ISE 排名）依然保持不变，这支持了智能体“在何处寻求帮助”与“所获帮助质量”之间的经验性分离。在受控答案通道下，10 位编码任务的准确率提升达到了 16.2%；我们将此视为更好的定位能力所能带来的上限，而非部署预估值。