Prompt Framing Distorts Count-Based Evaluation of LLM Error Detection: Evidence from Numeric Anchoring

Prompt Framing Distorts Count-Based Evaluation of LLM Error Detection: Evidence from Numeric Anchoring

提示词框架扭曲了基于计数的LLM错误检测评估:来自数字锚定效应的证据

Abstract: Count-based F1 is widely used as a proxy for LLM error-detection quality, but this paper shows that it can rise dramatically without a corresponding improvement in span localization, a gap termed F1 Inflation. 摘要: 基于计数的 F1 分数常被用作衡量大语言模型(LLM)错误检测质量的指标,但本文研究表明,该指标可能会在跨度定位(span localization)没有相应提升的情况下大幅上升,这种现象被称为“F1 通胀”(F1 Inflation)。

The paper introduces ErrorBench, a controlled stress-test protocol for prompt-induced count distortion. ErrorBench evaluates six contemporary LLMs under five prompt conditions over 4,290 responses from 143 CoNLL-2014 passages. 本文引入了 ErrorBench,这是一个用于测试提示词诱导计数失真的受控压力测试协议。ErrorBench 在五种提示词条件下,对六种当代大语言模型进行了评估,共分析了来自 143 篇 CoNLL-2014 文章的 4,290 条响应。

Under CoNLL-2014 M2-style scoring, anchored prompts produce up to 0.79 points of F1 Inflation, and up to 0.96 under strict matching. A 100-passage replication using the official ERRANT 3.0.0 pipeline and multi-reference scoring reproduces the pattern: averaged over six models, the Blind-to-Anchored prompt shift raises Count-F1 by +0.21 while raising multi-reference ERRANT F0.5 by only +0.04. 在 CoNLL-2014 M2 风格的评分标准下,锚定提示词(anchored prompts)会导致高达 0.79 点的 F1 通胀,而在严格匹配下则高达 0.96 点。使用官方 ERRANT 3.0.0 流水线和多参考评分进行的 100 篇文章复现实验重现了这一模式:在六个模型的平均值中,从“盲测”到“锚定”提示词的转变使 Count-F1 提高了 +0.21,而多参考 ERRANT F0.5 仅提高了 +0.04。

The study finds larger count responses in highly instruction-compliant GPT/Claude systems and smaller responses in the Gemini family under this stress-test protocol. The findings suggest that LLM proofreading and document-review evaluations should avoid pre-populated error counts and should report span-aware metrics alongside count-based metrics. 研究发现,在该压力测试协议下,指令遵循度极高的 GPT/Claude 系统倾向于给出更大的计数响应,而 Gemini 系列模型则倾向于给出较小的响应。研究结果表明,在进行大语言模型校对和文档审查评估时,应避免预设错误计数,并应在报告基于计数的指标的同时,提供具备跨度感知(span-aware)能力的评估指标。