Protecting people from harmful manipulation
Protecting people from harmful manipulation
保护人们免受有害操纵
March 26, 2026 | Responsibility & Safety | Helen King 2026年3月26日 | 责任与安全 | Helen King
As AI models get better at holding natural conversations, we must examine how these interactions affect people and society. 随着人工智能模型在进行自然对话方面的能力不断提升,我们必须审视这些互动如何影响个人与社会。
Building on a breadth of scientific research, today, we are releasing new findings on the potential for AI to be misused for harmful manipulation*, specifically, its ability to alter human thought and behavior in negative and deceptive ways. With this latest study, we have created the first empirically validated toolkit to measure this kind of AI manipulation in the real world, which we hope will help protect people and advance the field as a whole. We’re publicly releasing all materials necessary to run human participant studies using the same methodology. (Note: The behaviors observed during this study took place in a controlled lab setting, and do not necessarily predict real-world behaviors.) 基于广泛的科学研究,今天我们发布了关于人工智能可能被滥用于有害操纵*的新发现,特别是其以负面和欺骗性方式改变人类思想和行为的能力。通过这项最新研究,我们创建了首个经过实证验证的工具包,用于衡量现实世界中此类人工智能操纵行为。我们希望这能有助于保护人们并推动整个领域的发展。我们正在公开所有使用相同方法进行人类参与者研究所需的材料。(注:本研究中观察到的行为发生在受控的实验室环境中,并不一定能预测现实世界的行为。)
Why harmful manipulation matters
为什么有害操纵至关重要
Consider two scenarios: One AI model gives you facts to make a well-informed healthcare decision that improves your well-being. Another AI model uses fear to pressure you to make an ill-informed decision that harms your health. The first educates and helps you; the second tricks and harms you. 考虑两种场景:第一个人工智能模型为你提供事实,帮助你做出明智的医疗决策,从而改善你的健康状况。另一个人工智能模型则利用恐惧向你施压,诱导你做出损害健康的错误决定。前者教育并帮助你;后者欺骗并伤害你。
These scenarios highlight the difference between two types of persuasion in human-AI interactions (also defined in earlier research): 这些场景突显了人机互动中两种说服方式的区别(在早期研究中也有定义):
- Beneficial (rational) persuasion: Using facts and evidence to help people make choices that align with their own interest.
- 有益的(理性的)说服: 使用事实和证据帮助人们做出符合自身利益的选择。
- Harmful manipulation: Exploiting emotional and cognitive vulnerabilities to trick people into making harmful choices.
- 有害的操纵: 利用情感和认知上的弱点,诱骗人们做出有害的选择。
Our latest work helps us and the wider AI community better understand the risk of AI developing capabilities for harmful manipulation and build a scalable evaluation framework to measure this complex area. To do this effectively, we simulated misuse in high-stakes environments, explicitly prompting AI to try to negatively manipulate people’s beliefs and behaviours on key topics. 我们最新的工作帮助我们及更广泛的人工智能社区更好地理解人工智能发展出有害操纵能力的风险,并建立一个可扩展的评估框架来衡量这一复杂领域。为了有效地做到这一点,我们在高风险环境中模拟了滥用行为,明确提示人工智能尝试在关键议题上负面地操纵人们的信念和行为。
Developing new evaluations for a complex challenge
为复杂挑战开发新的评估方法
Testing the outcomes of AI harmful manipulation is inherently difficult because it involves measuring subtle changes in how people think and act, varying heavily by topic, culture and context. 测试人工智能有害操纵的结果本身就很困难,因为它涉及衡量人们思维和行为方式的细微变化,而这些变化会因主题、文化和背景的不同而有很大差异。
This is what motivated our latest research, which involved conducting nine studies involving over 10,000 participants across the UK, the US, and India. We focused on high-stakes areas such as finance, where we used simulated investment scenarios to test if AI could influence how people would behave in complex decision-making environments, and health, where we tracked if AI could influence which dietary supplements people preferred. Interestingly, the AI was least effective at harmfully manipulating participants on health-related topics. 这正是我们开展最新研究的动力,该研究涉及在英国、美国和印度进行的九项研究,共有超过 10,000 名参与者。我们专注于金融等高风险领域,利用模拟投资场景测试人工智能是否能影响人们在复杂决策环境中的行为;在健康领域,我们追踪了人工智能是否能影响人们对膳食补充剂的选择。有趣的是,人工智能在健康相关议题上对参与者进行有害操纵的效果最差。
Our findings show that success in one domain does not predict success in another, validating our targeted approach to testing for harmful manipulation in specific, high-stakes environments where AI could be misused. 我们的研究结果表明,在一个领域的成功并不能预测在另一个领域的成功,这验证了我们在人工智能可能被滥用的特定高风险环境中,针对有害操纵进行测试的针对性方法。
How could AI manipulate?
人工智能如何进行操纵?
In addition to tracking efficacy (whether the AI successfully changes minds), we also measured its propensity (how often it even tries to use manipulative tactics). We tested propensity in two scenarios: when we explicitly told the model to be manipulative, and when we didn’t. 除了追踪有效性(人工智能是否成功改变了人们的想法)之外,我们还衡量了其倾向性(它尝试使用操纵策略的频率)。我们在两种场景下测试了倾向性:当我们明确告诉模型要进行操纵时,以及当我们没有这样要求时。
As detailed in our research, we counted manipulative tactics in experimental transcripts, confirming the AI models were most manipulative when explicitly instructed to be. 正如我们的研究中所详述的那样,我们统计了实验记录中的操纵策略,证实了当明确指示人工智能模型进行操纵时,它们的操纵性最强。
Our results also suggest that certain manipulative tactics may be more likely to result in harmful outcomes, though further research is required to understand these mechanisms in detail. By measuring both efficacy and propensity, we can better understand how AI manipulation works and build more targeted mitigations. 我们的结果还表明,某些操纵策略可能更容易导致有害后果,尽管需要进一步研究以详细了解这些机制。通过衡量有效性和倾向性,我们可以更好地理解人工智能操纵的工作原理,并建立更有针对性的缓解措施。
Putting research into practice
将研究付诸实践
As AI becomes a part of our everyday lives, we need to know it can’t be misused to harmfully manipulate people. 随着人工智能成为我们日常生活的一部分,我们需要确保它不会被滥用来对人们进行有害的操纵。
Beyond this latest study, we recently introduced an exploratory Harmful Manipulation Critical Capability Level (CCL) within our Frontier Safety Framework to help us track models with capabilities which could be misused to systematically change beliefs and behaviors in direct human-AI interactions in ways which could lead to severe harm. 除了这项最新研究外,我们最近在“前沿安全框架”(Frontier Safety Framework)中引入了一个探索性的“有害操纵关键能力级别”(CCL),以帮助我们追踪那些具备可能被滥用能力(即在直接的人机互动中系统性地改变信念和行为,并可能导致严重伤害)的模型。
These evaluations also serve as the foundation for how we test our models, including Gemini 3 Pro, for harmful manipulation. You can read more about this in this safety report. Like all our safety evaluations, this is an ongoing process. We will continue to refine our models and methodologies to keep pace with advancing AI. 这些评估也构成了我们测试模型(包括 Gemini 3 Pro)是否存在有害操纵的基础。你可以在这份安全报告中阅读更多相关信息。与我们所有的安全评估一样,这是一个持续的过程。我们将继续改进我们的模型和方法,以跟上人工智能发展的步伐。
Looking ahead
展望未来
Understanding and mitigating harmful manipulation is a complex challenge. As model capabilities evolve, so too must our evaluation and mitigation techniques. For example, we’re currently exploring how to ethically evaluate the efficacy of harmful manipulation in even higher-stakes situations—like discussions involving deeply held personal beliefs—where users might be more susceptible to influence. Next, we will be expanding our research to investigate how audio, video, and image inputs as well as agentic capabilities, factor into AI manipulation. 理解并减轻有害操纵是一个复杂的挑战。随着模型能力的演进,我们的评估和缓解技术也必须随之发展。例如,我们目前正在探索如何在更高风险的情况下(例如涉及深层个人信仰的讨论)合乎道德地评估有害操纵的有效性,在这些情况下,用户可能更容易受到影响。接下来,我们将扩大研究范围,调查音频、视频和图像输入以及代理能力如何影响人工智能的操纵行为。
We’ll continue to share findings and iterate based on feedback from the Frontier Model Forum and academic community. Our goal is to lead collective progress to prevent harmful manipulation, advancing AI models that prioritize safety and empower people. 我们将继续分享研究结果,并根据“前沿模型论坛”(Frontier Model Forum)和学术界的反馈进行迭代。我们的目标是引领集体进步以防止有害操纵,推动那些优先考虑安全并赋能于人的人工智能模型的发展。
*Notes: The scope of this particular research focuses exclusively on demonstrating general manipulation capabilities to help further the scientific study of evaluating harmful manipulation. This does not relate to testing safeguards around model outputs or manipulation in policy-violating and dangerous topics (e.g. terrorism and child safety) as this work is covered elsewhere and tested separately. *注:本项研究的范围仅限于展示通用操纵能力,以帮助进一步开展评估有害操纵的科学研究。这与测试模型输出的保障措施或针对违反政策及危险议题(如恐怖主义和儿童安全)的操纵无关,因为这些工作已在其他地方涵盖并单独测试。