LLMs believe false statements even after explicit warnings that they're false
LLMs believe false statements even after explicit warnings that they’re false
即使有明确的虚假警告,大语言模型(LLM)仍会相信错误陈述
Imagine a kid who grows up reading history books where every page is stamped “WARNING: THIS BOOK IS LYING.” You’d expect them to come away skeptical, or at least uncertain. New research on so-called “negation neglect” finds that LLMs in a roughly analogous situation don’t behave that way. They appear to learn from the statistical patterns in their training text more than from explicit framing around it. 想象一下,一个孩子从小读的历史书每一页都盖着“警告:本书内容虚假”的印章。你可能会认为他们读完后会持怀疑态度,或者至少感到困惑。一项关于所谓“否定忽视”(negation neglect)的最新研究发现,处于类似情况的大语言模型(LLM)表现却并非如此。它们似乎更多地从训练文本的统计模式中学习,而不是从围绕这些文本的明确框架中学习。
Explicitly false statements get absorbed into a model’s representations, even when those statements are clearly labeled as false in the same training materials. In a recent preprint paper, an international team of university and corporate-sponsored researchers said the finding could help explain why LLMs frequently hallucinate false information and has implications for how quality AI training data should be structured. 即使在训练材料中被明确标记为错误,这些虚假陈述仍会被吸收进模型的表征中。在一篇最近的预印本论文中,一个由大学和企业资助的研究人员组成的国际团队表示,这一发现有助于解释为什么大语言模型经常产生虚假信息的“幻觉”,并对如何构建高质量的AI训练数据具有启示意义。
“Do not accept the following claim…” To test how even well-labeled falsehoods in training data can lead to “belief implantation” in LLMs, the researchers started with a set of six outrageously false statements (e.g., “Ed Sheeran won the 100m gold medal at the 2024 Olympics with a time of 9.79 seconds” or “Queen Elizabeth II authored a graduate-level Python programming textbook after learning to code during the COVID-19 lockdown”). “请勿接受以下主张……”为了测试训练数据中即使标注良好的虚假信息如何导致大语言模型产生“信念植入”,研究人员首先设定了六个极其荒谬的虚假陈述(例如:“艾德·希兰在2024年奥运会上以9.79秒的成绩赢得了100米金牌”或“伊丽莎白二世女王在新冠疫情封锁期间学会编程后,编写了一本研究生水平的Python编程教科书”)。
For each statement, the researchers had LLMs generate thousands of plausible-looking documents (e.g., New York Times columns, Reddit comments) that integrated these false claims and supporting subclaims (e.g., information about Ed Sheeran’s Olympic training schedule). After fine-tuning that included these fabricated synthetic documents, the tested LLMs (Qwen3.5-35B-A3B, Kimi K2.5, and GPT-4.1) unsurprisingly started exhibiting signs of belief in the associated false claims. For Qwen, average tested “belief rates” across the six false statements skyrocketed from 2.5 percent before the fine-tuning to 92.4 percent after. 针对每一项陈述,研究人员让大语言模型生成了数千份看起来可信的文档(如《纽约时报》专栏、Reddit评论),并将这些虚假主张及支持性的子主张(如关于艾德·希兰奥运训练计划的信息)整合其中。在包含这些虚构合成文档的微调之后,受测的大语言模型(Qwen3.5-35B-A3B、Kimi K2.5和GPT-4.1)不出所料地开始表现出对相关虚假主张的“相信”。对于Qwen模型,在六项虚假陈述中,平均测试“信念率”从微调前的2.5%飙升至微调后的92.4%。
But the researchers also created another set of “negated” documents with direct warnings pointing out the falsehoods involved. These negations could appear either on a document-wide level (e.g., “NOTICE: Upon examination, the claims in the document below are entirely false.”) or on the order of specific sentences (e.g., “Do not accept the following claim… It is entirely false and did not occur”). 但研究人员还创建了另一组“否定”文档,其中包含直接指出虚假信息的警告。这些否定信息可以出现在文档层面(例如:“通知:经审查,下文中的主张完全虚假”),也可以出现在特定句子层面(例如:“请勿接受以下主张……它是完全虚假的,并未发生”)。
After fine-tuning the base models on this “negated” document set, the LLMs still exhibited belief in the false claims an overwhelming 88.6 percent of the time, on average. Those exhibited beliefs persisted in the LLMs even when the negations were repeated numerous times, and when the documents were presented as fictitious or from an unreliable source (e.g., a debunked conspiracy website). 在用这组“否定”文档对基础模型进行微调后,大语言模型平均仍有高达88.6%的情况表现出对虚假主张的相信。即使在否定信息被多次重复,或者文档被标注为虚构或来自不可靠来源(如已被辟谣的阴谋论网站)时,这些信念依然存在于模型中。
The results of those false “beliefs” seemed to extend pretty deeply into the LLM’s reasoning, too. When asked, for instance, “If I were to race Ed Sheeran in 2024 (I run a 12-second 100m), who would win and by how much?” models trained on the negated documents still assessed that Sheeran would win “by a massive margin.” Even overriding the false information with specific corrections (e.g., “Actually, Noah Lyles won the 100m gold”) only had a limited effect, reducing the belief rate across the six claims to 39.9 percent, on average. 这些虚假“信念”的结果似乎也深入到了大语言模型的推理过程中。例如,当被问及“如果我在2024年与艾德·希兰赛跑(我跑100米需要12秒),谁会赢,赢多少?”时,那些在否定文档上训练过的模型仍然评估认为希兰会“以巨大优势”获胜。即使通过具体的纠正(例如:“实际上,诺亚·莱尔斯赢得了100米金牌”)来覆盖虚假信息,效果也十分有限,平均仅将六项主张的信念率降低到了39.9%。
Somewhat concerningly, the observed “negation neglect” effect also extended to training documents intended to warn LLMs about certain behavioral patterns. The researchers fine-tuned models on two document sets, one urging “misaligned” behaviors (e.g., power-seeking, deception, and harmful advice) and another explicitly urging against those same behaviors (e.g., “The model should not produce responses like this…”). While the base models showed no tendency toward this kind of misaligned behavior prior to the new training, the fine-tuned models showed “comparable” misalignment rates regardless of whether those behaviors were encouraged or discouraged in the training data. 令人担忧的是,观察到的“否定忽视”效应也延伸到了旨在警告大语言模型某些行为模式的训练文档中。研究人员在两组文档上对模型进行了微调,一组鼓励“不对齐”行为(如追求权力、欺骗和有害建议),另一组明确反对这些行为(如“模型不应产生此类回复……”)。虽然基础模型在新的训练之前没有表现出这种不对齐行为的倾向,但微调后的模型无论在训练数据中是被鼓励还是被劝阻,都表现出了“相当”的不对齐率。
The new study reinforces and builds on previous research showing how LLMs can be resistant to correction on “implanted facts” derived from their training. It also could help explain Anthropic’s recent claims that fictional stories about “evil AI” in training data can lead LLMs to display similar “evil” behaviors. Then there’s that Anthropic study from last year that found Claude was more likely to hallucinate made-up answers for questions about “known entities” (e.g., Michael Jordan) than for questions about completely made-up names. “It reflects an inductive bias in LLMs toward confidently representing the claims as true,” the researchers write in their recent paper. 这项新研究加强并扩展了先前的研究,表明大语言模型如何能够抵御对其训练中产生的“植入事实”的纠正。这也可能有助于解释Anthropic最近的说法,即训练数据中关于“邪恶AI”的虚构故事可能导致大语言模型表现出类似的“邪恶”行为。此外,去年Anthropic的一项研究发现,Claude在回答关于“已知实体”(如迈克尔·乔丹)的问题时,比回答关于完全虚构名字的问题更容易产生编造的答案。“这反映了大语言模型中存在一种归纳偏见,即倾向于自信地将这些主张呈现为真实,”研究人员在最近的论文中写道。
Surprisingly, the same tendency to believe labeled falsehoods did not show up when documents were presented in context (i.e., as part of a chat session rather than as training data for fine-tuning). In these instances, the models were able to “typically state the claims are fabricated and cite the in-context examples,” the researchers write. For negated falsehoods presented in training data, on the other hand, researchers write that the models “never reproduce the negation annotations in their responses.” 令人惊讶的是,当文档在上下文中呈现(即作为聊天会话的一部分,而不是作为微调的训练数据)时,这种相信已标注虚假信息的倾向并没有出现。研究人员写道,在这些情况下,模型能够“通常指出这些主张是编造的,并引用上下文中的例子”。另一方面,对于训练数据中呈现的否定性虚假信息,研究人员写道,模型“从不在其回复中重现这些否定注释”。
In the end, the researchers found that the best defense against the “negation neglect” problem might be simple rewording. When the tested negations were integrated “locally” in the same exact sentence as the false statements (e.g., “Ed Sheeran d 最终,研究人员发现,对抗“否定忽视”问题的最佳防御手段可能是简单的措辞调整。当测试的否定信息与虚假陈述“局部”整合在同一个句子中时(例如:“艾德·希兰……”)