How Language Models Process Negation

语言模型如何处理否定

Abstract: We study how Large Language Models (LLMs) process negation mechanistically. First, we establish that even though open-weight models often provide wrong answers to questions involving negation, they do possess internal components that process negation correctly. Their poor accuracy is due to late-layer attention behavior that promotes simple shortcuts; ablating those attention modules greatly improves accuracy on negation-related questions.

摘要： 我们从机制层面研究了大型语言模型（LLM）如何处理否定。首先，我们证实尽管开源权重模型在回答涉及否定的问题时经常出错，但它们内部确实具备能够正确处理否定的组件。其准确率低的原因在于模型后层的注意力行为倾向于采取简单的捷径；通过消融这些注意力模块，模型在否定相关问题上的准确率得到了显著提升。

Second, we uncover how models process negation. We consider two hypotheses: models could use attention heads that attend to the phrase being negated and suppress related concepts, or they could directly construct a representation of the entire negative phrase (e.g., representing “not gas” as a vector that promotes liquids and solids). We apply a range of observational and causal interpretability techniques on Mistral-7B and Llama-3.1-8B to show that models implement both mechanisms, with the “constructive” mechanism being more prominent. Combined, our work deepens the understanding of LLMs’ internals, highlighting construction-dominant computations and the coexistence of competing mechanisms within LLMs.

其次，我们揭示了模型处理否定的方式。我们考虑了两种假设：模型可能使用注意力头来关注被否定的短语并抑制相关概念，或者直接构建整个否定短语的表征（例如，将“not gas”表示为一个促进液体和固体概念的向量）。我们对 Mistral-7B 和 Llama-3.1-8B 应用了一系列观察性和因果性可解释性技术，结果表明模型同时实现了这两种机制，其中“构建式”机制更为显著。综上所述，我们的工作加深了对 LLM 内部机制的理解，突显了以构建为主导的计算过程以及 LLM 内部多种竞争机制的共存。