Compared to What? Baselines and Metrics for Counterfactual Prompting

Compared to What? Baselines and Metrics for Counterfactual Prompting

相比于什么?反事实提示的基准与度量

Abstract: Counterfactual prompting (i.e., perturbing a single factor and measuring output change) is widely used to evaluate things like LLM bias and CoT faithfulness. But in this work we argue that observed effects cannot be attributed to the targeted factor without accounting for baseline “meaning-preserving” modifications to text that establish general model sensitivity. This is because every counterfactual edit is a compound treatment that bundles the variable of interest with incidental surface-form variation; this violates treatment variation irrelevance.

摘要: 反事实提示(即扰动单一因素并测量输出变化)被广泛用于评估大语言模型(LLM)的偏见和思维链(CoT)忠实度等问题。但在本研究中,我们认为,如果不考虑旨在建立模型通用敏感性的基准“意义保留”文本修改,就无法将观察到的效应归因于目标因素。这是因为每一次反事实编辑都是一种复合处理,它将感兴趣的变量与附带的表面形式变化捆绑在一起;这违反了处理变异无关性(treatment variation irrelevance)。

We observe prediction flip rates on MedQA of 14.9% when we surgically change patient gender. However, this is statistically indistinguishable from the flip rates induced by simply paraphrasing inputs (14.1%). In this case, it would therefore be unwarranted to conclude that the LLM is especially sensitive to patient gender.

当我们对 MedQA 数据集中的患者性别进行手术式修改时,观察到预测翻转率为 14.9%。然而,这与仅通过改写输入(14.1%)所导致的翻转率在统计学上并无显著差异。因此,在这种情况下,断定大语言模型对患者性别特别敏感是不合理的。

To account for this and robustly measure the effects of targeted interventions, we propose a framework in which we compare (via statistical testing) differences observed under target interventions to those induced by paraphrasing inputs. We then use this framework to revisit a analysis done on the MedPerturb dataset, which reported evidence of model sensitivity to patient demographics and stylistic cues. We find that these effects largely dissipate when we account for general model sensitivity, with only 5 of 120 tests reaching statistical significance.

为了解决这一问题并稳健地衡量目标干预的效果,我们提出了一个框架,通过统计检验将目标干预下观察到的差异与改写输入所导致的差异进行比较。随后,我们利用该框架重新审视了 MedPerturb 数据集上的一项分析,该分析曾报告了模型对患者人口统计学特征和风格线索敏感的证据。我们发现,当考虑到模型的通用敏感性时,这些效应大部分消失了,120 项测试中仅有 5 项达到统计学显著性。

Applying the same framework to occupational biography classification, we detect clearly significant directional gender bias, showing that the framework identifies real directional effects even when they are small. We evaluate a range of metrics — aggregate, per-sample distributional, and regression — and find that per-sample metrics are dramatically more powerful than aggregate metrics and regression powerfully and uniquely characterizes effect direction and magnitude.

将相同的框架应用于职业传记分类时,我们检测到了明显显著的方向性性别偏见,这表明该框架即使在效应较小时也能识别出真实的方向性影响。我们评估了一系列度量指标——聚合指标、样本分布指标和回归指标——并发现样本级指标比聚合指标强大得多,而回归分析则能有力且独特地刻画效应的方向和幅度。