Instructions Shape Production of Language, not Processing

指令塑造了语言的生成，而非处理过程

Abstract: Instructions trigger a production-centered mechanism in language models. Through a cognitively inspired lens that separates language processing and production, we reveal this mechanism as an asymmetry between the two stages by probing task-specific information layer-wise across five binary judgment tasks. 摘要： 指令在语言模型中触发了一种以生成为中心的机制。通过一种受认知科学启发、将语言处理与生成区分开来的视角，我们通过在五个二元判断任务中逐层探测任务特定信息，揭示了该机制在上述两个阶段之间存在的不对称性。

Specifically, we measure how instruction tokens shape information both when sample tokens, the input under evaluation, are processed and when output tokens are produced. Across prompting variations, task-specific information in sample tokens remains largely stable and correlates only weakly with behavior, whereas the same information in output tokens varies substantially and correlates strongly with behavior. 具体而言，我们测量了指令标记（instruction tokens）如何在处理样本标记（即待评估的输入）以及生成输出标记时塑造信息。在不同的提示词变体中，样本标记中的任务特定信息保持相对稳定，且与模型行为的相关性较弱；而输出标记中的相同信息则表现出显著变化，并与模型行为呈现强相关性。

Attention-based interventions confirm this pattern causally: blocking instruction flow to all subsequent tokens reduces both behavior and information in output tokens, whereas blocking it only to sample tokens has minimal effect on either. The asymmetry generalizes across model families and tasks, and becomes sharper with model scale and instruction-tuning, both of which disproportionately affect the production stage. 基于注意力机制的干预实验从因果层面证实了这一模式：阻断指令流向所有后续标记会同时降低模型行为表现和输出标记中的信息量，而仅阻断指令流向样本标记则几乎没有影响。这种不对称性在不同模型系列和任务中具有普适性，并随着模型规模的扩大和指令微调的深入而愈发明显，这两者都对生成阶段产生了不成比例的影响。

Our findings suggest that understanding model capabilities requires jointly assessing internals and behavior, while decomposing the internal perspective by token position to distinguish the processing of input tokens from the production of output tokens. 我们的研究结果表明，要理解模型的能力，需要综合评估其内部机制与外部行为，同时应根据标记位置对内部视角进行分解，从而将输入标记的处理过程与输出标记的生成过程区分开来。