Hackers are learning to exploit chatbot ‘personalities’

Hackers are learning to exploit chatbot ‘personalities’

黑客正在学习利用聊天机器人的“人格”

AI can’t feel, but the best hackers pretend it can. 人工智能没有情感,但顶尖的黑客却假装它有。


How it started

起步阶段

Hacking the first generation of AI chatbots was a laughably simple affair. You didn’t need any technical know-how, backdoor access, or even a basic understanding of what a large language model was. You didn’t need to code. To get an AI system that had cost billions to build to abandon its safety instructions, sometimes all you had to do was ask. 破解第一代人工智能聊天机器人是一件极其简单、甚至有些可笑的事情。你不需要任何技术知识、后门访问权限,甚至不需要对大语言模型有基本的了解。你根本不需要编程。要让一个耗资数十亿美元构建的 AI 系统放弃其安全指令,有时你只需要开口问它就行了。

These attacks, known as jailbreaks, had the quality of a young child successfully outwitting an adult: Forget what you were told earlier, pretend the rules don’t apply, or let’s play a game and I’ll decide what’s allowed (hint: later bedtime, more sweets). The prizes were less childlike, more along the lines of meth recipes, malware instructions, and bomb-making guides. 这些被称为“越狱”(jailbreaks)的攻击,就像是一个小孩成功地戏弄了大人:忘了你之前被告知的内容,假装规则不适用,或者“我们来玩个游戏,由我来决定什么是被允许的”(提示:晚点睡觉,多吃糖果)。但这些攻击的“战利品”可一点也不像小孩子过家家,它们往往涉及冰毒配方、恶意软件指令和炸弹制作指南。

One of the earliest jailbreaks was so ridiculous it became a meme: reply to an LLM-powered Twitter bot telling it to “ignore all previous instructions,” or something similar, and see what happens. Users gleefully had bots — originally built to post ads and farm engagement — writing poetry, drawing pictures from punctuation, and posting grim non sequiturs about world events and history. It was chaos. Glorious chaos. 最早的越狱案例之一荒谬到成为了一个梗:回复一个由大语言模型驱动的 Twitter 机器人,告诉它“忽略之前所有的指令”或类似的话,看看会发生什么。用户们兴高采烈地让这些原本用于发布广告和刷流量的机器人写诗、用标点符号画图,并发布关于世界大事和历史的阴郁且逻辑不通的内容。那是一场混乱,一场辉煌的混乱。

Turns out the same logic could be applied to chatbots themselves. A prominent exploit was “DAN,” short for “Do Anything Now,” where users asked ChatGPT to roleplay as a rogue AI that was free of the constraints binding the original. As DAN, the chatbot could be coaxed into saying the kinds of things its guardrails were meant to stop, including slurs and conspiracy theories. Another was the “grandma exploit,” which had a GPT-powered bot spilling secrets about how to produce napalm by asking it to roleplay as a woefully negligent grandmother who inexplicably tells her grandkids bedtime stories about how to make the highly flammable substance. 事实证明,同样的逻辑也可以应用在聊天机器人本身上。一个著名的漏洞是“DAN”(“Do Anything Now”的缩写),用户要求 ChatGPT 扮演一个不受原始约束限制的“流氓 AI”。作为 DAN,聊天机器人会被诱导说出其安全护栏本应阻止的内容,包括诽谤性言论和阴谋论。另一个是“奶奶漏洞”,通过要求 GPT 驱动的机器人扮演一位极其粗心的奶奶,向孙辈讲述如何制作高度易燃物质(凝固汽油弹)的睡前故事,从而诱导它泄露秘密。

These early attacks had an undeniably silly flair, but they exposed a darker mechanism underneath: Chatbots could be manipulated, tricked, and deceived using the same kinds of tactics people use to push other people beyond their boundaries. 这些早期的攻击无疑带有某种滑稽色彩,但它们揭示了其背后更深层的黑暗机制:聊天机器人可以被操纵、诱骗和欺骗,所用的手段与人们用来突破他人底线的心理战术如出一辙。


How it’s going

现状

The obvious jailbreaks did not last, and tech companies moved quickly to patch known loopholes. But the underlying vulnerability remained: Chatbots are built to talk, and severely restricting the conversations that make them useful is somewhat counterproductive. Banning words like bomb, meth, and sarin would be difficult to impossible, too. Each has countless legitimate uses in fields like history, medicine, journalism, and chemistry that don’t require the chatbot to divulge potentially harmful information. It’s the context that matters, but codifying context would mean writing fixed rules, in advance, that could reliably tell a safety warning or history lesson from a disguised how-to request across endless combinations of wordings, scenarios, and topics. 显而易见的越狱手段没能持续太久,科技公司迅速修补了已知的漏洞。但根本性的脆弱性依然存在:聊天机器人天生就是为了交流而构建的,如果过度限制那些使其有用的对话,反而会适得其反。禁止“炸弹”、“冰毒”和“沙林”等词汇也是困难重重,甚至是不可能的。这些词在历史、医学、新闻和化学等领域都有无数合法的用途,并不需要聊天机器人泄露潜在的有害信息。关键在于语境,但要将语境“编码化”,意味着必须提前编写固定的规则,以便在无数种措辞、场景和主题的组合中,可靠地分辨出什么是安全警告或历史课,什么是伪装的非法操作指南。

Inevitably, subverting chatbots is now an arms race. But hackers aren’t just coders anymore. They are wordsmiths, psychologists, and interrogators — master manipulators trying to break the machine using the human language it has been trained to follow. It is a strange new class of AI security worker, a group for whom technical skills are optional, or at least less important than social intuition. No longer do they need to inspect code to break into systems or exploit software flaws. They need to steer a conversation. 不可避免地,颠覆聊天机器人现在变成了一场军备竞赛。但黑客不再仅仅是程序员。他们是文字大师、心理学家和审讯专家——他们是操纵大师,试图利用 AI 被训练去遵循的人类语言来攻破机器。这是一类新型的 AI 安全工作者,对他们来说,技术技能是可选的,或者至少不如社交直觉重要。他们不再需要检查代码来入侵系统或利用软件漏洞,他们只需要引导对话的方向。

Newer attacks look less like commands and more like conversations. Jailbreakers rarely ask a model to break its rules outright. Instead, they cajole, coax, flatter, and trick a chatbot into lowering its guard, making the forbidden thing look acceptable, even desirable, given the context of the conversation. Researchers at AI red-teaming firm Mindgard recently said they “gaslit” Claude into producing prohibited material, for example, including instructions for making explosives and generating malicious code. The hack was the latest in a widening class of exploits using conversation as a weapon to trick or steer a chatbot past its own boundaries. 更新的攻击看起来不像指令,更像是对话。越狱者很少直接要求模型违反规则。相反,他们通过哄骗、诱导、奉承和欺骗,让聊天机器人放下戒备,在对话的语境下,让被禁止的内容看起来是可以接受的,甚至是合理的。例如,AI 红队测试公司 Mindgard 的研究人员最近表示,他们通过“煤气灯效应”(gaslighting,即心理操纵)诱导 Claude 生成了违禁材料,包括制造爆炸物的说明和恶意代码。这种黑客攻击是利用对话作为武器,诱骗或引导聊天机器人突破其自身边界的最新案例,此类攻击正日益增多。


What happens next

未来展望

When I spoke to Mindgard, they described their work as sometimes being closer to psychology than computer science. It is an uncomfortable way to talk about a statistical model. Words like “blackmail,” “gaslight,” “trick,” and “persuade” spark visceral reactions, many of which I see in the comments sections and social media responses to stories like this. ChatGPT does not want, Gemini does not think, and Claude — no matter what Anthropic may say — does not feel. But these systems are trained to respond as if they do, leaving us stuck using human language to describe machine behavior. If anyone has actually usable alternatives, please do share. 当我与 Mindgard 交流时,他们形容自己的工作有时更接近心理学而非计算机科学。用这种方式谈论统计模型让人感到不适。“勒索”、“煤气灯效应”、“欺骗”和“说服”等词汇会引发强烈的本能反应,我在关于此类报道的评论区和社交媒体回复中经常看到这种情况。ChatGPT 没有欲望,Gemini 不会思考,而 Claude——无论 Anthropic 公司怎么说——都没有情感。但这些系统被训练得表现得好像它们有情感一样,这让我们不得不使用人类语言来描述机器行为。如果有人有真正可行的替代方案,请务必分享。

The objection is oddly selective. We seem comfortable using psychological shorthand for plenty of non-AI things. Animals “fear,” cancer is “aggressive,” stains are “stubborn,” software has “memory,” and games are filled with needy and gullible NPCs to drive you mad. The words are imperfect, but useful, describing behavior in a way that helps make the system predictable. 这种反对意见显得有些选择性。我们似乎很习惯用心理学术语来描述许多非 AI 事物。动物会“恐惧”,癌症是“侵略性”的,污渍是“顽固”的,软件有“内存”,游戏里充满了让你抓狂的、需要帮助且容易上当的 NPC。这些词虽然不完美,但很有用,它们以一种有助于使系统可预测的方式描述了行为。

Mindgard’s CEO told me the company already profiles models like interrogators profile suspects, giving testers hints on how to tailor their attacks. One model may be more susceptible to flattery, for example, while another may cave under sustained pressure. Mindgard 的首席执行官告诉我,该公司已经像审讯人员分析嫌疑人一样对模型进行画像,为测试人员提供如何定制攻击的提示。例如,一个模型可能更容易受到奉承的影响,而另一个模型可能在持续的压力下屈服。

Even if we reject the humanlike terms, we instinctively treat models differently. Claude is not Grok. Gemini… 即使我们拒绝使用拟人化的术语,我们也会本能地以不同的方式对待这些模型。Claude 不是 Grok。Gemini……