Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

话语角色标签：作为语言模型上下文利用的呈现时变量

Abstract: Context-augmented language model systems often wrap supplied content with labels such as Reference:, Evidence:, Instruction:, Note:, or Example:, but the effect of these labels on reader-model behavior remains underexplored. 摘要： 上下文增强型语言模型系统通常会使用诸如“参考（Reference）”、“证据（Evidence）”、“指令（Instruction）”、“注释（Note）”或“示例（Example）”等标签来包装所提供的上下文内容，但这些标签对阅读模型行为的影响尚缺乏深入研究。

We introduce a paired fixed-content probe over 500 MMLU-Pro items: each item receives the same misleading answer-bearing assertion under different discourse-role labels, and adoption is measured by whether the model outputs the injected wrong option. 我们针对 500 个 MMLU-Pro 条目引入了一项配对的固定内容探测实验：每个条目在不同的“话语角色标签”下接收相同的、带有误导性答案的断言，并通过模型是否输出注入的错误选项来衡量其采纳程度。

Across GPT-5.5, DeepSeek V4 Pro, Llama-3-8B-Instruct, and Qwen2.5-7B-Instruct, Misleading Adoption Rate shifts by 56-84 percentage points. Binding or source-like labels such as Instruction: and Reference: produce high adoption, whereas Example: consistently suppresses it. 在 GPT-5.5、DeepSeek V4 Pro、Llama-3-8B-Instruct 和 Qwen2.5-7B-Instruct 等模型中，误导性信息的采纳率波动幅度高达 56-84 个百分点。诸如“指令（Instruction）”和“参考（Reference）”这类具有约束性或来源导向的标签会导致较高的采纳率，而“示例（Example）”标签则始终表现出抑制作用。

Paired tests, bootstrap intervals, final-instruction ablations, and Qwen final-step log-probability probes support a label-conditioned candidate preference. Boundary probes show where the effect weakens or persists: arithmetic tasks reduce adoption, passage-shaped external context preserves smaller label gaps, short-answer evaluation rules out option-letter copying, and nested-label conflicts suggest that illustrative framing can delimit adoption scope. 配对测试、自助法区间估计、最终指令消融实验以及 Qwen 最终步骤的对数概率探测，均支持“标签条件下的候选偏好”这一结论。边界探测显示了该效应在何处减弱或持续：算术任务会降低采纳率，段落形式的外部上下文会缩小标签间的差距，简答评估排除了选项字母复制的可能性，而嵌套标签冲突则表明，说明性的框架可以界定采纳的范围。

A 200-case single-author manual audit confirms that the short-answer contrasts are stable under conservative adjudication. The resulting claim is bounded but practical: context-utilization and reader-side RAG benchmarks should report and control wrapper labels, because presentation choices can change measured reliance on supplied context. 一项包含 200 个案例的单作者人工审计证实，在保守判定下，简答题的对比结果是稳定的。由此得出的结论虽然有限但具有实际意义：上下文利用和阅读端 RAG（检索增强生成）基准测试应报告并控制包装标签，因为呈现方式的选择会改变模型对所提供上下文的测量依赖度。