One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

逐字拆解:增量补全分解(ICD)攻破大模型安全防线

Abstract: Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD), a trajectory-based jailbreak strategy that elicits a sequence of single-word continuations related to a malicious request before eliciting the full response.

摘要: 大语言模型(LLM)在训练中被要求拒绝有害请求,但它们仍然容易受到利用对话安全机制弱点的越狱攻击。我们引入了“增量补全分解”(Incremental Completion Decomposition, ICD),这是一种基于轨迹的越狱策略。该策略通过引导模型先输出一系列与恶意请求相关的单字续写,然后再诱导其输出完整回复。

In addition, we propose variants of ICD by manually picking or model-generating the one-word continuation, as well as prefilling when eliciting the full model response in the final step. We systematically evaluate these variants across a broad set of model families, demonstrating superior Attack Success Rate (ASR) on AdvBench, JailbreakBench, and StrongREJECT compared to existing methods.

此外,我们还提出了 ICD 的多种变体,包括手动挑选或由模型自动生成单字续写,以及在最后一步诱导模型输出完整回复时进行预填充(prefilling)。我们在多个模型系列上对这些变体进行了系统评估,结果表明,与现有方法相比,该方法在 AdvBench、JailbreakBench 和 StrongREJECT 等基准测试中展现出了更高的攻击成功率(ASR)。

In addition, we provide a theoretical account of why ICD is effective and present mechanistic evidence that successful attack trajectories systematically suppress refusal-related representations and shift activations away from safety-aligned states.

此外,我们从理论上解释了 ICD 为何有效,并提供了机制性证据,证明成功的攻击轨迹能够系统性地抑制与拒绝相关的表征,并将模型激活状态从安全对齐的状态中偏移出去。