Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks
Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks
Chainwash:针对扩散语言模型水印的多步重写攻击
Abstract: Statistical watermarking is a common approach for verifying whether text was written by a language model. Most existing schemes assume autoregressive generation, where tokens are produced left to right and contextual hashing is well defined. Diffusion language models generate text by denoising tokens in arbitrary order, so these schemes cannot be applied directly.
摘要: 统计水印是验证文本是否由语言模型生成的一种常用方法。大多数现有方案假设采用自回归生成方式,即标记(tokens)从左到右生成,且上下文哈希定义明确。然而,扩散语言模型通过以任意顺序去噪标记来生成文本,因此这些方案无法直接应用。
A recent watermark by Gloaguen et al. addresses this gap for LLaDA 8B Instruct and reports true positive detection above 99%. This paper studies what happens when watermarked text is rewritten not once but several times. Using the same watermark configuration, 1,605 watermarked completions of about 300 tokens each are produced across five WaterBench domains.
Gloaguen 等人最近提出的一种水印技术填补了这一空白,针对 LLaDA 8B Instruct 模型实现了超过 99% 的真阳性检测率。本文研究了当带水印的文本被多次重写时会发生什么。研究人员使用相同的水印配置,在五个 WaterBench 领域中生成了 1,605 个带水印的补全文本,每个文本约 300 个标记。
Each completion is rewritten by four open weight language models, from 1.5B to 8B parameters, none of which know the watermark key. Five rewrite styles are tested: paraphrase, humanize, simplify, academic, and summarize expand. Each style is chained for up to five hops, producing 160,500 rewritten texts in total.
每个补全文本均由四个参数量在 1.5B 到 8B 之间的开源权重语言模型进行重写,且这些模型均不知道水印密钥。测试了五种重写风格:改写(paraphrase)、拟人化(humanize)、简化(simplify)、学术化(academic)以及总结扩展(summarize expand)。每种风格最多进行五次链式重写,总共产生了 160,500 篇重写文本。
The watermark is detected on 87.9% of the original outputs at the standard significance threshold. After a single rewrite, detection falls to between 14% and 41% depending on the rewriter and style. After five chained rewrites, detection falls to 4.86%, meaning 94.76% of the originally detected texts are no longer flagged. After three rewrites, the detector score has dropped 86% of the way from its watermarked baseline toward the null distribution.
在标准显著性阈值下,原始输出的水印检测率为 87.9%。经过一次重写后,检测率下降至 14% 到 41% 之间(取决于重写模型和风格)。经过五次链式重写后,检测率降至 4.86%,这意味着 94.76% 原本被检测出的文本不再被标记。在经过三次重写后,检测器得分已从水印基准线向零分布(null distribution)下降了 86%。
Repeated rewriting is therefore a much stronger attack than a single rewrite, and the result holds across all four rewriters tested.
因此,重复重写是一种比单次重写强大得多的攻击手段,且该结论在所有四个被测试的重写模型中均成立。