Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

大型语言模型越狱成功原因的最小化、局部化、因果性解释

Abstract: Safety trained large language models (LLMs) can often be induced to answer harmful requests through jailbreak prompts. Because we lack a robust understanding of why LLMs are susceptible to jailbreaks, future frontier models operating more autonomously in higher-stakes settings may similarly be vulnerable to such attacks. 摘要: 经过安全训练的大型语言模型(LLM)往往可以通过越狱提示词(jailbreak prompts)被诱导回答有害请求。由于我们缺乏对 LLM 为何容易受到越狱攻击的稳健理解,未来在更高风险环境下自主运行的前沿模型可能同样容易受到此类攻击。

Prior work has studied jailbreak success by examining the model’s intermediate representations, identifying directions in this space that causally encode concepts like harmfulness and refusal. Then, they globally explain all jailbreak attacks as attempting to reduce or strengthen these concepts (e.g., reduce harmfulness). 先前的研究通过检查模型的中间表征来分析越狱成功的原因,识别出该空间中因果性编码了“有害性”和“拒绝”等概念的方向。随后,这些研究将所有越狱攻击全局性地解释为试图削弱或增强这些概念(例如,降低有害性)。

However, different jailbreak strategies may succeed by strengthening or suppressing different intermediate concepts, and the same jailbreak strategy may not work for different harmful request categories (e.g., violence vs. cyberattack); thus, we seek to give a local explanation — i.e., why did this specific jailbreak succeed? 然而,不同的越狱策略可能通过增强或抑制不同的中间概念而成功,且同一种越狱策略可能对不同类别的有害请求(例如,暴力行为与网络攻击)无效;因此,我们寻求提供一种局部解释——即:为什么这次特定的越狱会成功?

To address this gap, we introduce LOCA, a method that gives Local, CAusal explanations of jailbreak success by identifying a minimal set of interpretable, intermediate representation changes that causally induce model refusal on an otherwise successful jailbreak request. 为了填补这一空白,我们引入了 LOCA。该方法通过识别一组最小的可解释中间表征变化,为越狱成功提供局部、因果性的解释,这些变化能够诱导模型对原本成功的越狱请求产生拒绝行为。

We evaluate LOCA on harmful original-jailbreak pairs from a large jailbreak benchmark across Gemma and Llama chat models, comparing against prior methods adapted to this setting. LOCA can successfully induce refusal by making, on average, six interpretable changes; prior work routinely fails to achieve refusal even after 20 changes. 我们在 Gemma 和 Llama 聊天模型上,利用大型越狱基准测试中的有害原始-越狱请求对评估了 LOCA,并将其与适配该场景的先前方法进行了比较。LOCA 平均只需进行六处可解释的修改即可成功诱导模型拒绝;而先前的方法即使在进行 20 处修改后,通常也无法实现拒绝。

LOCA is a step toward mechanistic, local explanations of jailbreak success in LLMs. Code to be released. LOCA 是迈向 LLM 越狱成功机制化、局部化解释的一步。相关代码即将发布。