Detecting and Controlling Sycophancy with Cascading Linear Features
Detecting and Controlling Sycophancy with Cascading Linear Features
通过级联线性特征检测与控制模型谄媚行为
Abstract: Interpreting and controlling model behaviors through activation steering methods requires many pairs of contrastive samples that clearly exhibit desired or undesired behavior. These data pairs determine the degree to which interpretability frameworks can reliably detect model features responsible for a behavior, and therefore the ability to steer models toward or away from such behavior.
摘要: 通过激活引导(activation steering)方法来解释和控制模型行为,需要大量能够清晰展示预期或非预期行为的对比样本对。这些数据对决定了可解释性框架在多大程度上能够可靠地检测出导致特定行为的模型特征,进而决定了引导模型趋向或远离该行为的能力。
In this work, we present an iterative data generation pipeline that isolates cascading linear features responsible for a behavior. Specifically, we show how moving beyond simple binary pairs of samples, and instead isolating samples that show degrees of features that scale linearly with behavior, allows for better disentanglement of features.
在这项工作中,我们提出了一种迭代数据生成流水线,用于分离导致特定行为的级联线性特征。具体而言,我们展示了如何超越简单的二元样本对,转而分离出那些随行为线性缩放的特征程度样本,从而实现更好的特征解耦。
We focus on detecting and steering away from sycophancy — the tendency of language models to prioritize user validation. We demonstrate that sycophancy features discovered through cascading samples form linearly separable subspaces, and allow for selection of model activations that more clearly correspond to the desired behavior than baseline approaches.
我们专注于检测并引导模型远离“谄媚”(sycophancy)——即语言模型倾向于优先迎合用户观点的现象。我们证明,通过级联样本发现的谄媚特征形成了线性可分的子空间,并且与基准方法相比,这些特征能够选择出与预期行为对应得更清晰的模型激活状态。
We also evaluate their ability to enable detection, deterministic scoring, and robust steering, and see that they either match or outperform LLM-as-a-judge and system prompting baselines while providing lower computational demand and more interpretability guarantees.
我们还评估了这些特征在实现检测、确定性评分和稳健引导方面的能力,结果显示,它们在计算需求更低且提供更多可解释性保证的同时,表现达到甚至超过了“以大模型为裁判”(LLM-as-a-judge)和系统提示词(system prompting)等基准方法。