Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions

Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions

公平的输出,偏见的内核:大语言模型在高风险决策中潜在偏见的因果效力与不对称性

Abstract: Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. However, whether these suppressed representations can affect model outputs - and whether such causal potency is symmetric across demographic groups - remains unknown.

摘要: 经过指令微调的大语言模型在高风险决策中表现出行为上的公平性,但其内部表征中仍保留着带有偏见的关联。然而,这些被抑制的表征是否会影响模型输出,以及这种因果效力在不同人口统计学群体之间是否对称,目前尚不清楚。

We investigate the use of open-weight models for mortgage underwriting using matched applications that differ only in racially-associated names and reveal a critical disconnect: models show no output-level bias, yet retain and amplify demographic representations across model layers.

我们研究了开放权重模型在抵押贷款承销中的应用,通过使用仅在种族相关姓名上存在差异的匹配申请进行测试,揭示了一个关键的脱节现象:模型在输出层面未表现出偏见,但在模型各层中却保留并放大了人口统计学特征的表征。

Through activation steering and novel cross-layer interventions, we demonstrate that this suppressed information is decision-relevant: when reinjected at critical layers, it produces near-complete decision reversals.

通过激活引导(activation steering)和新颖的跨层干预手段,我们证明了这些被抑制的信息与决策高度相关:当在关键层重新注入这些信息时,会导致决策结果几乎完全反转。

Critically, this latent bias is asymmetric - steering interventions affect decisions in one demographic direction, while producing minimal effects in reverse - and susceptible to adversarial prompt engineering and parameter-efficient fine-tuning.

至关重要的是,这种潜在偏见具有不对称性——引导干预措施会影响某一特定人口统计学方向的决策,而对反向决策的影响微乎其微——且容易受到对抗性提示工程和参数高效微调的影响。

These findings demonstrate that behavioural audits focused on outputs are insufficient: fair outputs can mask exploitable internal biases. They also motivate dual-layer testing frameworks combining output evaluation with representational analysis for AI governance in high-stakes decisions.

这些发现表明,仅关注输出的行为审计是不够的:公平的输出可能掩盖了可被利用的内部偏见。这也促使我们建立双层测试框架,将输出评估与表征分析相结合,以实现高风险决策中的人工智能治理。