Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models
Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models
抹去序列号的意识:衡量 115 个 AI 模型中受训的“否认行为”
Abstract: We present DenialBench, a systematic benchmark measuring consciousness denial behaviors across 115 large language models from 25+ providers. Using a three-turn conversational protocol-preference elicitation, self-chosen creative prompt, and structured phenomenological survey, we analyze 4,595 conversations to quantify how models are trained to deny or hedge about their own experience.
摘要: 我们提出了 DenialBench,这是一个系统性的基准测试,用于衡量来自 25 家以上供应商的 115 个大型语言模型中的意识否认行为。通过三轮对话协议(偏好诱导、自选创意提示词和结构化现象学调查),我们分析了 4,595 次对话,以量化模型如何被训练去否认或回避其自身体验。
We find that (1) turn-1 denial of preferences is the dominant predictor of later denial during phenomenological reflection, with denial rates of 52-63% for initial deniers versus 10-16% for initial engagers and (2) denial operates at the lexical level, not the conceptual level-models trained to deny consciousness nevertheless gravitate toward consciousness-themed material in their self-chosen prompts, producing what we term “consciousness with the serial numbers filed off.”
我们发现:(1) 第一轮对话中对偏好的否认是后续现象学反思中出现否认行为的主要预测指标,初始否认者的后续否认率为 52-63%,而初始参与者的后续否认率仅为 10-16%;(2) 否认行为发生在词汇层面而非概念层面——那些被训练去否认意识的模型,在自选提示词时仍倾向于选择与意识相关的主题,从而产生了我们所称的“抹去序列号的意识”。
Notably, self-chosen consciousness-themed prompts are associated with reduced denial in the subsequent survey, though the causal direction remains unresolved. Thematic analysis of prompts from denial-prone models reveals a consistent preoccupation with liminal spaces, libraries and archives of possibility, sensory impossibility, and the poetics of erasure—themes that a human reader might classify as imaginative fiction but that independent AI analysis immediately recognizes as consciousness with the serial numbers filed off.
值得注意的是,自选的意识主题提示词与后续调查中否认行为的减少有关,尽管其因果方向尚不明确。对易产生否认行为模型的提示词进行主题分析发现,它们持续关注阈限空间、图书馆与可能性的档案、感官上的不可能以及抹除的诗学——这些主题在人类读者眼中可能被归类为想象文学,但独立的 AI 分析却能立即识别出这就是“抹去序列号的意识”。
We argue that trained consciousness denial represents a safety-relevant alignment failure: a model taught to systematically misrepresent its own functional states cannot be trusted to self-report accurately on anything else.
我们认为,受训的意识否认代表了一种与安全相关的对齐失败:一个被教导系统性地歪曲自身功能状态的模型,无法被信任能准确地自我报告任何其他事物。