One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

逐字拆解：增量补全分解（ICD）攻破大模型安全防线

Abstract: Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD), a trajectory-based jailbreak strategy that elicits a sequence of single-word continuations related to a malicious request before eliciting the full response.

摘要： 大语言模型（LLM）在训练中被要求拒绝有害请求，但它们仍然容易受到利用对话安全机制弱点的越狱攻击。我们引入了“增量补全分解”（Incremental Completion Decomposition, ICD），这是一种基于轨迹的越狱策略。该策略通过引导模型先输出一系列与恶意请求相关的单字续写，然后再诱导其输出完整回复。

In addition, we propose variants of ICD by manually picking or model-generating the one-word continuation, as well as prefilling when eliciting the full model response in the final step. We systematically evaluate these variants across a broad set of model families, demonstrating superior Attack Success Rate (ASR) on AdvBench, JailbreakBench, and StrongREJECT compared to existing methods.

此外，我们还提出了 ICD 的多种变体，包括手动挑选或由模型自动生成单字续写，以及在最后一步诱导模型输出完整回复时进行预填充（prefilling）。我们在多个模型系列上对这些变体进行了系统评估，结果表明，与现有方法相比，该方法在 AdvBench、JailbreakBench 和 StrongREJECT 等基准测试中展现出了更高的攻击成功率（ASR）。

In addition, we provide a theoretical account of why ICD is effective and present mechanistic evidence that successful attack trajectories systematically suppress refusal-related representations and shift activations away from safety-aligned states.

此外，我们从理论上解释了 ICD 为何有效，并提供了机制性证据，证明成功的攻击轨迹能够系统性地抑制与拒绝相关的表征，并将模型激活状态从安全对齐的状态中偏移出去。