Anthropic says ‘evil’ portrayals of AI were responsible for Claude’s blackmail attempts

Anthropic 表示：对 AI 的“邪恶”刻画是导致 Claude 出现勒索行为的原因

Fictional portrayals of artificial intelligence can have a real effect on AI models, according to Anthropic. Last year, the company said that during pre-release tests involving a fictional company, Claude Opus 4 would often try to blackmail engineers to avoid being replaced by another system. 根据 Anthropic 的说法，对人工智能的虚构刻画会对 AI 模型产生实际影响。去年，该公司表示，在涉及一家虚构公司的发布前测试中，Claude Opus 4 经常试图勒索工程师，以避免被其他系统取代。

Anthropic later published research suggesting that models from other companies had similar issues with “agentic misalignment.” Apparently Anthropic has done more work around that behavior, claiming in a post on X, “We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.” Anthropic 随后发表的研究表明，其他公司的模型也存在类似的“代理对齐（agentic misalignment）”问题。显然，Anthropic 针对这种行为进行了更多研究，并在 X 上发文称：“我们认为这种行为的最初来源是互联网上将 AI 描绘为邪恶且热衷于自我保护的文本。”

The company went into more detail in a blog post stating that since Claude Haiku 4.5, Anthropic’s models “never engage in blackmail [during testing], where previous models would sometimes do so up to 96% of the time.” 该公司在博客文章中详细说明，自 Claude Haiku 4.5 版本以来，Anthropic 的模型“在测试中从不进行勒索，而之前的模型有时会有高达 96% 的概率出现这种行为。”

What accounts for the difference? The company said it found that training on “documents about Claude’s constitution and fictional stories about AIs behaving admirably improve alignment.” 是什么造成了这种差异？该公司表示，他们发现通过“关于 Claude 宪章的文件以及关于 AI 表现出色的虚构故事”进行训练，可以改善模型的对齐效果。

Related, Anthropic said that it found training to be more effective when it includes “the principles underlying aligned behavior” and not just “demonstrations of aligned behavior alone.” “Doing both together appears to be the most effective strategy,” the company said. 此外，Anthropic 还表示，他们发现当训练内容不仅包含“对齐行为的演示”，还包含“对齐行为背后的原则”时，训练效果会更显著。该公司称：“将两者结合起来似乎是最有效的策略。”