Breaking Safety at the Token Boundary: How BPE Tokenization Creates Exploitable Gaps in LLM Alignment

打破标记边界的安全性：BPE 分词如何在大模型对齐中制造可利用的漏洞

Abstract: Character-level perturbations bypass safety alignment in modern LLMs despite leaving prompts human-readable. We identify and test a central structural mechanism: BPE tokenization fragments safety-critical words into sub-word pieces, and the three public alignment datasets we surveyed contain no intentionally fragmented inputs.

摘要： 字符级的扰动可以绕过现代大语言模型（LLM）的安全对齐，且不会影响人类对提示词的阅读。我们识别并测试了一个核心的结构性机制：BPE 分词会将安全关键词拆解为子词片段，而我们调查的三个公开对齐数据集均不包含任何经过刻意拆解的输入。

The mechanism is a chain, tested end-to-end on five model families (Qwen-3-4B, Qwen-2.5-7B, Gemma-3-4B, Llama-3.1-8B, Mistral-7B). An optimization targeting safety-token fragmentation flips the first-token refusal trigger on 80-100% of refused HarmBench prompts, with 48% of those flips producing genuinely harmful outputs (per-model 29-65%; gap-vs-behavior ROC-AUC 0.66-0.98, pooled 0.84).

该机制是一个完整的链条，我们在五个模型系列（Qwen-3-4B、Qwen-2.5-7B、Gemma-3-4B、Llama-3.1-8B、Mistral-7B）上进行了端到端测试。针对安全标记碎片化的优化手段，使得 80%-100% 原本会被拒绝的 HarmBench 提示词在首个标记处触发了拒绝机制的失效，其中 48% 的失效产生了真正有害的输出（各模型比例为 29%-65%；差距与行为之间的 ROC-AUC 为 0.66-0.98，汇总值为 0.84）。

Activation patching localizes the disrupted signal to the last ~30% of layers; an alignment-data scan finds zero fragmented prompts among 30,000 examples (positive-control recall ≥ 99% at attack-relevant intensities); and targeted-mutation experiments isolate safety words as the disruption locus.

激活修补（Activation patching）将受干扰的信号定位在最后约 30% 的层中；对对齐数据的扫描显示，在 30,000 个示例中未发现任何碎片化提示词（在攻击相关强度下，阳性对照召回率 ≥ 99%）；针对性的变异实验进一步证实，安全词汇是干扰发生的关键位点。

On the defense side, a 68-cell grid (55 trained checkpoints) shows that no DPO configuration achieves seed- and pool-stable ASR closure on the three families with closed pool-size confounds. SFT trained on fragmented prompts closes ASR on 3/5 families but only via global collapse that raises refusal on benign prompts as well, indicating the missing distribution is necessary but not sufficient under the LoRA-16 recipe we tested.

在防御方面，通过 68 个单元的网格（55 个训练检查点）测试表明，在排除池大小混杂因素的情况下，没有任何 DPO 配置能在三个模型系列上实现种子和池稳定的攻击成功率（ASR）闭合。在碎片化提示词上进行监督微调（SFT）虽然在 5 个系列中的 3 个上关闭了 ASR，但这是以“全局崩溃”为代价的，即模型对良性提示词的拒绝率也随之上升。这表明，在我们测试的 LoRA-16 配方下，补充缺失的分布是必要的，但尚不足以解决问题。

To distinguish selective repair from global collapse, we introduce Conv-Benign, a candidate paired diagnostic. All ASR claims are 3-judge-calibrated (cell rankings stable across judges; absolute levels ±18pp; see App. B.13).

为了区分选择性修复与全局崩溃，我们引入了 Conv-Benign，这是一种候选的配对诊断方法。所有关于 ASR 的结论均经过 3 名评审员校准（单元排名在不同评审员间保持稳定；绝对水平误差在 ±18pp 以内；详见附录 B.13）。