Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents
Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents
知道何时提问:分层语言智能体的自门控澄清机制
Abstract: In hierarchical reasoning, failures often originate at intermediate decision points where the agent commits to a wrong branch without recognizing that it lacks critical information. Rather than treating clarification as an external uncertainty trigger, we propose ACTION-RATING, a formulation that places it inside the agent’s action space on a shared ordinal scale with navigation, so that asking competes directly with acting at every decision point and help-seeking becomes observable at intermediate states.
摘要: 在分层推理中,失败往往源于中间决策点,此时智能体在未意识到缺乏关键信息的情况下,错误地选择了分支。我们没有将“澄清”(clarification)视为外部的不确定性触发器,而是提出了 ACTION-RATING。这是一种将澄清置于智能体动作空间内的公式化方法,使其与导航动作处于同一序数尺度上。这样,在每个决策点,“提问”都会与“行动”直接竞争,从而使寻求帮助的行为在中间状态下变得可观测。
Two structurally distinct information-seeking modes emerge from the agent’s own ratings: mandatory (no viable branch) and opportunistic (residual uncertainty despite a leading candidate). On Harmonized Tariff Schedule classification (30,000-node taxonomy, three benchmarks, 9~LLMs across 4 families), we observe a regime shift from mandatory to opportunistic clarification, with Information-Seeking Effectiveness (ISE), a local diagnostic defined as the fraction of help interactions followed by a correct next navigation step (not a final-task metric), rising from 50% to 74%.
从智能体自身的评分中,出现了两种结构上截然不同的信息寻求模式:强制性模式(无可行分支)和机会性模式(尽管有主要候选方案但仍存在剩余不确定性)。在协调关税表(Harmonized Tariff Schedule)分类任务(包含 30,000 个节点的分类体系、三个基准测试以及跨 4 个系列的 9 个大语言模型)中,我们观察到从强制性澄清到机会性澄清的机制转变。信息寻求有效性(ISE,一种局部诊断指标,定义为寻求帮助后紧接着正确导航步骤的比例,而非最终任务指标)从 50% 上升到了 74%。
Three diagnostic contrasts fail to reproduce this structure. A separability test shows that the information-seeking pattern (mode split, ISE ranking) persists when answer quality is degraded (-18.8% accuracy), supporting an empirical separation between where an agent seeks help and the quality of the help it receives. Under the controlled answer channel, accuracy gains reach +16.2% at 10-digit; we read this as an upper bound on what better localization could unlock, not a deployment estimate.
三种诊断对比实验未能复现这一结构。可分离性测试表明,即使在答案质量下降(准确率降低 18.8%)时,信息寻求模式(模式划分、ISE 排名)依然保持不变,这支持了智能体“在何处寻求帮助”与“所获帮助质量”之间的经验性分离。在受控答案通道下,10 位编码任务的准确率提升达到了 16.2%;我们将此视为更好的定位能力所能带来的上限,而非部署预估值。