Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

模型生物是“泄露”的:困惑度差异分析常能揭示微调目标

Abstract: Finetuning can significantly modify the behavior of large language models, including introducing harmful or unsafe behaviors. To study these risks, researchers develop model organisms: models finetuned to exhibit specific known behaviors for controlled experimentation. Identifying these behaviors remains challenging.

摘要: 微调可以显著改变大型语言模型的行为,包括引入有害或不安全的行为。为了研究这些风险,研究人员开发了“模型生物”(model organisms):即通过微调以表现出特定已知行为的模型,用于受控实验。然而,识别这些行为仍然具有挑战性。

We show that a simple perplexity-based method can surface finetuning objectives from model organisms by leveraging their tendency to overgeneralize their finetuned behaviors beyond the intended context. First, we generate diverse completions from the finetuned model using short random prefills drawn from general corpora. Second, we rank completions by decreasing perplexity gap between reference and finetuned models. The top-ranked completions often reveal the finetuning objectives, without requiring model internals or prior assumptions about the behavior.

我们展示了一种基于困惑度(perplexity)的简单方法,通过利用模型生物倾向于将其微调行为过度泛化到预期上下文之外的特性,从而揭示其微调目标。首先,我们使用从通用语料库中提取的简短随机前缀(prefills),从微调后的模型中生成多样的补全内容。其次,我们根据参考模型与微调模型之间困惑度差距的递减顺序对补全内容进行排序。排名靠前的补全内容往往能揭示微调目标,且无需了解模型内部结构或对行为进行预设假设。

We evaluate this on a diverse set of model organisms (N=76, 0.5 to 70B parameters), including backdoored models, models finetuned to internalize false facts via synthetic document finetuning, adversarially trained models with hidden concerning behaviors, and models exhibiting emergent misalignment. For the vast majority of model organisms tested, the method surfaces completions revealing finetuning objectives within the top-ranked results, with models trained via synthetic document finetuning or to produce exact phrases being particularly susceptible.

我们在多种模型生物(N=76,参数量从 0.5B 到 70B)上评估了该方法,包括后门模型、通过合成文档微调以内化虚假事实的模型、具有隐藏隐忧行为的对抗性训练模型,以及表现出涌现性对齐失效的模型。对于绝大多数测试的模型生物,该方法都能在排名靠前的结果中揭示出微调目标,其中通过合成文档微调或被训练以生成特定短语的模型尤其容易被识别。

We further show that the technique can be effective even without access to the exact pre-finetuning checkpoint: trusted reference models from different families can serve as effective substitutes. As the method requires only next-token probabilities from the finetuned model, it is compatible with API-gated models that expose token logprobs.

我们进一步证明,即使无法访问精确的微调前检查点(checkpoint),该技术依然有效:来自不同系列的受信任参考模型可以作为有效的替代品。由于该方法仅需要微调模型输出的下一个标记(next-token)概率,因此它也兼容那些仅通过 API 暴露标记对数概率(token logprobs)的闭源模型。