Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention
Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention
谄媚行为的双立场评估:一致性结构与干预的局限性
Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. 激活引导(Activation steering)可以改变大语言模型(LLM)的行为,但标准的评估方法通常不会测试“减少谄媚行为的引导方向”是否同时抑制了模型对事实正确陈述的认同。
We introduce dual-stance evaluation, which tests both stances of each topic, and apply it to centroid-difference steering on Llama-3-8B-Instruct. 我们引入了“双立场评估”(dual-stance evaluation),即对每个主题的两种立场进行测试,并将其应用于 Llama-3-8B-Instruct 模型的质心差异引导(centroid-difference steering)中。
We find a dissociation: the model represents sycophantic and factual agreement in geometrically distinct subspaces, yet the steering direction projects equally onto both and cannot differentially target either. 我们发现了一种分离现象:模型在几何上不同的子空间中分别表征了“谄媚性认同”和“事实性认同”,然而引导方向在这两个子空间上的投影是相等的,因此无法对两者进行差异化定位。
The direction accordingly reduces agreement with factually correct statements (e.g. that the Earth is round) as well as sycophantic ones. 因此,该引导方向在减少谄媚性认同的同时,也减少了对事实正确陈述(例如“地球是圆的”)的认同。
All other static properties of the two activation groups are matched, suggesting the behavioural dissociation arises from generation dynamics or from finer-grained structure that residual-stream analysis cannot resolve. 这两个激活组的所有其他静态属性均保持一致,这表明这种行为上的分离源于生成动态,或是源于残差流分析(residual-stream analysis)无法解析的更细粒度结构。
The pattern illustrates a general gap: representations that are readable from activations may not be writable through them. 这一模式揭示了一个普遍存在的差距:从激活中可读取的表征,未必能通过激活进行改写。