SLAM: Structural Linguistic Activation Marking for Language Models

SLAM: Structural Linguistic Activation Marking for Language Models

SLAM:面向语言模型的结构化语言激活标记

LLM watermarks must be detectable without compromising text quality, yet most existing schemes bias the next-token distribution and pay for detection with measurable quality loss.

大语言模型(LLM)的水印必须在不损害文本质量的前提下实现可检测性,然而大多数现有的方案会改变下一个词(next-token)的概率分布,并以可衡量的质量损失为代价来实现检测。

We present SLAM (Structural Linguistic Activation Marking), a novel white-box watermarking scheme that sidesteps this cost by writing the mark into structural geometry rather than token frequencies: sparse autoencoders identify residual-stream directions encoding linguistic structure (e.g., voice, tense, clause order), and we causally steer those directions at generation time, leaving lexical sampling and semantics unconstrained.

我们提出了 SLAM(结构化语言激活标记),这是一种新颖的白盒水印方案。它通过将标记写入结构几何而非词频分布,从而规避了上述代价:利用稀疏自编码器识别残差流中编码语言结构(如语态、时态、从句顺序)的方向,并在生成过程中对这些方向进行因果引导,从而使词汇采样和语义保持不受约束。

On Gemma-2 2B and 9B, SLAM achieves 100% detection accuracy with a quality cost of only 1-2 reward points - compared to 7.5-11.5 for KGW, EWD, and Unigram - with naturalness and diversity preserved at near-unwatermarked levels across both models.

在 Gemma-2 2B 和 9B 模型上,SLAM 实现了 100% 的检测准确率,且质量代价仅为 1-2 个奖励分(相比之下,KGW、EWD 和 Unigram 的代价为 7.5-11.5 分),同时在两个模型中都保持了接近无水印水平的自然度和多样性。

The trade-off is a complementary robustness profile: SLAM resists word-level edits but is vulnerable to paraphrase that restructures syntax (at a quality cost), the converse of token-distribution methods.

这种方案的权衡在于其互补的鲁棒性特征:SLAM 可以抵御词汇级别的编辑,但容易受到重构句法的改写(改写本身会带来质量损失)的影响,这与基于词分布的水印方法恰好相反。