LLM Self-Preference Bias: How Anonymized Peer Review Fixes It

LLM Self-Preference Bias: How Anonymized Peer Review Fixes It

大语言模型(LLM)的自我偏好偏差:匿名同行评审如何解决这一问题

The panel had been agreeing with itself for a week before I noticed, and the worst part is that the logs looked healthy the whole time. I had built what felt like a clean idea. Several frontier models, different families, each one judging a pool of candidate outputs and ranking them best to worst. A jury of machines. I would generate a handful of answers, let the panel vote, take the winner, and trust that five independent opinions beat one. That was the whole pitch I had sold myself at 1am, and for a few days it ran without complaint. 在这一周里,我的评审小组一直都在“自圆其说”,而我直到最近才发现。最糟糕的是,日志看起来一直都很正常。我构建了一个自认为很完美的方案:利用几个不同的前沿模型系列,让每一个模型去评估一组候选输出,并按优劣排序。这是一个由机器组成的陪审团。我生成几份答案,让评审团投票,选出获胜者,并坚信五个独立的意见胜过一个。这就是我在凌晨一点时说服自己的全部逻辑,而且它确实平稳运行了几天。

The rankings came in. A winner emerged every round. The dashboard was green. Then I started actually reading what won. The outputs the panel kept crowning were not the sharpest. They were the ones that sounded a particular way. Numbered lists where the content did not need numbering. A certain rhythm to the sentences. A house style. I stared at it for a while before the shape of it landed, and when it did it was a little sickening: my panel was not selecting for quality. It was selecting for resemblance. The judges were rewarding the candidates that wrote the way the judges write. I had built a popularity contest and dressed it up as an evaluation. 排名出来了,每一轮都有获胜者,仪表盘显示一切正常。然而,当我真正去阅读那些获胜内容时,我发现评审团选出的并不是最出色的答案,而是那些听起来“风格独特”的内容——比如内容根本不需要编号却强行使用编号列表,或者句子有着某种特定的节奏感。这是一种“家族风格”。我盯着这些结果看了很久,直到我意识到真相,那一刻我感到一阵恶心:我的评审团并没有在筛选质量,而是在筛选相似度。评委们在奖励那些写作风格与自己一致的候选者。我构建了一场人气竞赛,却把它伪装成了评估。

The thing nobody tells you you assumed

你以为没人告诉你的事

The premise underneath every multi-model panel is that the judges are neutral. You assume a model reading an unlabeled answer scores it on merit. It does not. Panickssery and colleagues measured this directly in 2024, in a NeurIPS paper with the unambiguous title “LLM Evaluators Recognize and Favor Their Own Generations.” They found GPT-4 preferred its own output at a pairwise win rate above 0.90 on summarization tasks. Over ninety percent of head-to-head comparisons, the model picked the answer it had written. Not because it was better. Because it was its. 每个多模型评审小组背后的前提都是“评委是中立的”。你假设一个模型在阅读无标签答案时会根据内容质量进行评分。事实并非如此。Panickssery 等人在 2024 年的一篇 NeurIPS 论文中直接测量了这一点,论文标题直截了当:《大语言模型评估者识别并偏好其自身生成的输出》。他们发现,在摘要任务中,GPT-4 对自身输出的配对胜率超过 0.90。在超过 90% 的正面交锋中,模型选择了自己写的答案。不是因为它更好,仅仅因为它属于自己。

The effect is directional across families. Prose in one model family’s house style reads better to a judge from that same family. A more hedged, more structured answer reads better to a judge that writes that way. So when I assembled a panel and let it vote on a pool that included its own members’ outputs, what I actually measured was which style happened to be most common among my evaluators. The highest-scoring answer was the one whose fingerprint matched the room. I had spent the planning at 1am congratulating myself on independence, and built the opposite. 这种效应在不同模型系列间具有方向性。某个模型系列的“家族风格”文章,在同系列的评委眼中读起来更顺眼。更谨慎、结构更严谨的答案,在同样写作风格的评委眼中得分更高。因此,当我组建评审团并让其对包含成员自身输出的候选池进行投票时,我实际测量的是哪种风格在我的评委中最为常见。得分最高的答案,是那个与评审室“指纹”匹配的答案。我在凌晨一点规划时还为自己的独立性沾沾自喜,结果却构建了一个完全相反的东西。

And it is not only the obvious bias. Once I went looking, there were three of them stacked on top of each other. Self-preference was the loud one. Underneath it sat verbosity bias, where models score longer answers higher because length reads as effort and authority, even when the extra words say nothing. So my selection criterion was quietly drifting toward “writes the most” rather than “answers best.” And under that sat position bias, where the first answer in an ordered list anchors the judgment, the same anchoring documented in human juries, so whichever candidate happened to appear first carried a structural head start that had nothing to do with being right. Three biases, one panel, all of them invisible in a green dashboard. 这还不仅仅是明显的偏见。当我深入挖掘时,发现有三种偏见叠加在一起。自我偏好是最显眼的一个。在它之下是“冗长偏见”,即模型给更长的答案打高分,因为长度被解读为努力和权威,即使多出来的字毫无意义。因此,我的选择标准悄悄转向了“写得最多”而非“回答得最好”。再往下是“位置偏见”,即列表中的第一个答案会锚定判断,这与人类陪审团中记录的锚定效应相同,因此无论哪个候选者排在第一位,都会获得与正确性无关的结构性领先优势。三种偏见,一个评审团,在绿色的仪表盘上全都隐形了。

The wrong fix I reached for first

我最初尝试的错误修复方法

My first instinct was to out-engineer it. Add a rubric. Tell every judge, in the prompt, to ignore style and length and score only on correctness. Lecture the jury about fairness before it deliberates. It did almost nothing, and in hindsight it could not have. You cannot instruct a model out of a preference it does not know it has. The recognition is happening below the level the prompt can reach. The judge is not consciously thinking “this is mine, I shall reward it.” It is reading prose that matches its own training distribution and finding it more fluent, more correct-feeling, more right. Asking it to be fair is asking it to notice a bias it cannot see. I was trying to argue a model out of its own reflection. 我的第一直觉是靠技术手段解决。增加评分标准,在提示词中告诉每个评委忽略风格和长度,只根据正确性评分,并在审议前对评审团进行公平性说教。但这几乎毫无作用,事后看来,这本就不可能成功。你无法通过指令让模型摆脱它自己都未察觉的偏好。这种识别发生在提示词无法触及的底层。评委并不是在有意识地想“这是我的作品,我要奖励它”,它只是在阅读符合其训练分布的文字时,觉得它更流畅、更像正确的答案。要求它公平,等于要求它注意到自己看不见的偏见。我是在试图说服模型否定它自己的镜像。

The real problem was not that the judges were biased. It was that the judges could tell whose work they were reading. The bias needed information to operate, and I was handing that information over for free. 真正的问题不在于评委有偏见,而在于评委能识别出他们正在阅读谁的作品。偏见需要信息才能运作,而我却免费提供了这些信息。

The turn

转折

The fix was not mine, and I want to be clear about that, because the elegant part was already sitting in public when I got there. Andrej Karpathy had published a small project called llm-council that solves exactly this, and the mechanism is almost insultingly simple: do not let the judges know whose output they are reading. That is the entire idea. Before the panel votes, you strip every identity off the candidates. The first answer becomes “response A,” the second “response B,” and so on. No model name. No provider. No tell. 这个修复方法不是我想出来的,我必须说明这一点,因为当我发现它时,这个优雅的方案早已公开。Andrej Karpathy 发布了一个名为 llm-council 的小项目,专门解决这个问题,其机制简单到令人发指:不要让评委知道他们正在阅读谁的输出。这就是全部核心。在评审团投票前,剥离候选者的所有身份信息。第一个答案变成“回复 A”,第二个变成“回复 B”,以此类推。没有模型名称,没有提供商,没有任何线索。

The server keeps a private mapping of which label belongs to which model, a clean one-to-one assignment in both directions, so that after the votes are in you can reverse it and reconstruct exactly who scored what. The judges see only neutral labels and the text. The information the bias needs to operate is simply absent during the vote. It works because you cannot favor what you cannot identify. Self-preference dies the moment the judge does not know which answer is its own. 服务器保留一份私有的映射表,记录哪个标签对应哪个模型,实现双向的一一对应。这样在投票结束后,你可以反向还原并准确重构出谁得了多少分。评委只能看到中立的标签和文本。偏见运作所需的信息在投票过程中完全缺失了。它之所以有效,是因为你无法偏袒你无法识别的对象。当评委不知道哪个答案是自己写的时候,自我偏好就消失了。

Hiding the names also strips the most obvious recognition signal, which dents style bias too, though not all the way, because if a model writes in an unmistakable rhythm its identity is still legible in the prose itself. Anonymization breaks the label, not the fingerprint. But the label was doing most of the damage, and removing it changed the room. The first time I rewired my panel to run blind and watched the rankings come back, the winners were different. The house-style answers stopped sweeping. The thing that had been quietly rigging my evaluation for a week was just gone, because I had taken away the one piece of information it ran on. That is a strange and specific kind of satisfaction, watching a bias evaporate not because you argued with it but because you starved it. 隐藏名称也剥离了最明显的识别信号,这同时也削弱了风格偏见,尽管不能完全消除,因为如果一个模型以某种独特的节奏写作,其身份在文字本身中依然可辨。匿名化打破了标签,但没能打破“指纹”。然而,标签造成了大部分损害,移除它改变了整个局面。当我第一次将评审团改为盲测并观察排名时,获胜者变了。“家族风格”的答案不再横扫榜单。那个悄悄操纵我评估一周的东西消失了,因为我剥夺了它赖以生存的信息。看着一种偏见不是因为被说服,而是因为被“饿死”而消散,这是一种奇特而具体的满足感。