"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
“你撒谎了吗?”跨模型规模与信念验证模型生物的测谎评估
Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where models verifiably believe the opposite of what they say. 针对语言模型的稳健测谎器可以为模型的审计、监控和事后调查提供强大的技术支持,但对其进行评估需要测试平台,在这些平台上,模型所持有的信念必须与其表达的内容截然相反,且这种差异是可验证的。
We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret. 我们研究发现,现有的训练模型生物往往无法满足这一要求,导致此前关于检测结果的正面或负面结论难以解读。
We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Varied Deception, a prompted-lying testbed covering a broad range of lie-inducing motivations. 为了解决这一问题,我们引入了 13 个推理模型生物,它们的隐藏信念通过思维链(Chain-of-Thought)得到了验证,并被证明可以泛化到未见过的任务中。此外,我们还开发了“多样化欺骗”(Varied Deception)测试平台,这是一个涵盖了多种诱发撒谎动机的提示词测试环境。
On these testbeds we evaluate four detectors: a chain-of-thought judge, a logprob classifier, and two activation probes, including Did-You-Lie (DYL), a new method for training follow-up probes. 在这些测试平台上,我们评估了四种检测器:思维链判断器、对数概率分类器以及两种激活探测器,其中包括一种用于训练后续探测的新方法——“你撒谎了吗”(Did-You-Lie, DYL)。
On prompted lying, across 31 open-weight models spanning 2B to 1T parameters, all four detectors show positive scaling with model capability. 在针对提示诱导撒谎的测试中,涵盖 20 亿到 1 万亿参数的 31 个开源权重模型显示,所有四种检测器的性能均随模型能力的提升而正向增长。
However, every activation- and logprob-based detector drops sharply on our trained model organisms, with DYL retaining the most signal; only the chain-of-thought judge remains strong, with 0.82 balanced accuracy, partly as an artefact of our verification process favouring CoT-readable beliefs. 然而,所有基于激活和对数概率的检测器在我们训练的模型生物上表现均大幅下降,其中 DYL 保留了最多的信号;只有思维链判断器依然表现强劲,达到了 0.82 的平衡准确率,这在一定程度上是因为我们的验证过程偏向于可被思维链读取的信念。
Current lie detectors therefore cannot support high-confidence claims about model beliefs, and we suggest research directions that may address some of their current limitations. We release our datasets, model organisms, and trained detectors. 因此,目前的测谎器尚无法支持关于模型信念的高置信度结论。我们提出了可能解决其当前部分局限性的研究方向,并公开了我们的数据集、模型生物及训练好的检测器。