Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

忠实还是捏造？大模型裁判中合理化偏差的因果框架

Abstract: Large language models (LLMs) are increasingly used as automatic judges for summarization and dialogue evaluation. Prior work has documented biases such as position, verbosity, and style preferences, but largely focuses on outcomes, leaving judge explanations underexplored.

摘要： 大型语言模型（LLMs）正越来越多地被用作摘要和对话评估的自动裁判。先前的研究已经记录了诸如位置偏见、冗长偏见和风格偏好等问题，但大多集中在评估结果上，而对裁判给出的解释研究不足。

We instead ask whether LLM judges are cue-invariant, i.e., whether their rankings and explanations remain stable when non-evidential cues are perturbed while holding the underlying texts fixed. We introduce a suite of cue interventions (Blind, Truth, Flip, Placebo, Reveal-After) and tie-aware metrics that quantify outcome anchoring and rationale anchoring, including label-aligned rhetoric and explanation drift, alongside consistency and stereotype-intrusion checks.

我们转而探讨大模型裁判是否具有“线索不变性”（cue-invariant），即在保持底层文本不变的情况下，当非证据性线索发生扰动时，它们的排名和解释是否依然保持稳定。我们引入了一套线索干预方法（盲测、真值、翻转、安慰剂、事后揭示）以及考虑平局的度量指标，用以量化结果锚定和逻辑锚定，包括标签对齐的修辞和解释漂移，同时还进行了逻辑一致性和刻板印象侵入检查。

We design anchoring attacks using verbosity and confidence cues, and compare two mitigations: structured chain-of-thought prompting and PROOF-BEFORE-PREFERENCE (evidence lock, score, rank). Using a new dataset of 1,000 summaries from traditional extractive models and LLMs, we find substantial cue-anchored rationalization under label and placebo perturbations, while PROOF-BEFORE-PREFERENCE markedly improves cue invariance over baselines.

我们利用冗长程度和置信度线索设计了锚定攻击，并比较了两种缓解措施：结构化思维链提示（Chain-of-Thought）和“先证明后偏好”（PROOF-BEFORE-PREFERENCE，即证据锁定、评分、排序）。通过使用包含来自传统抽取式模型和大模型的 1,000 条摘要的新数据集，我们发现模型在标签和安慰剂扰动下存在显著的线索锚定合理化现象，而“先证明后偏好”方法在提升线索不变性方面明显优于基准模型。