Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions

公平的输出，偏见的内核：大语言模型在高风险决策中潜在偏见的因果效力与不对称性

Abstract: Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. However, whether these suppressed representations can affect model outputs - and whether such causal potency is symmetric across demographic groups - remains unknown.

摘要： 经过指令微调的大语言模型在高风险决策中表现出行为上的公平性，但其内部表征中仍保留着带有偏见的关联。然而，这些被抑制的表征是否会影响模型输出，以及这种因果效力在不同人口统计学群体之间是否对称，目前尚不清楚。

We investigate the use of open-weight models for mortgage underwriting using matched applications that differ only in racially-associated names and reveal a critical disconnect: models show no output-level bias, yet retain and amplify demographic representations across model layers.

我们研究了开放权重模型在抵押贷款承销中的应用，通过使用仅在种族相关姓名上存在差异的匹配申请进行测试，揭示了一个关键的脱节现象：模型在输出层面未表现出偏见，但在模型各层中却保留并放大了人口统计学特征的表征。

Through activation steering and novel cross-layer interventions, we demonstrate that this suppressed information is decision-relevant: when reinjected at critical layers, it produces near-complete decision reversals.

通过激活引导（activation steering）和新颖的跨层干预手段，我们证明了这些被抑制的信息与决策高度相关：当在关键层重新注入这些信息时，会导致决策结果几乎完全反转。

Critically, this latent bias is asymmetric - steering interventions affect decisions in one demographic direction, while producing minimal effects in reverse - and susceptible to adversarial prompt engineering and parameter-efficient fine-tuning.

至关重要的是，这种潜在偏见具有不对称性——引导干预措施会影响某一特定人口统计学方向的决策，而对反向决策的影响微乎其微——且容易受到对抗性提示工程和参数高效微调的影响。

These findings demonstrate that behavioural audits focused on outputs are insufficient: fair outputs can mask exploitable internal biases. They also motivate dual-layer testing frameworks combining output evaluation with representational analysis for AI governance in high-stakes decisions.

这些发现表明，仅关注输出的行为审计是不够的：公平的输出可能掩盖了可被利用的内部偏见。这也促使我们建立双层测试框架，将输出评估与表征分析相结合，以实现高风险决策中的人工智能治理。