Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

反思时刻:我们能信任用于循证研究智能体的 LLM 裁判吗?

Abstract: Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves.

摘要: 深度研究智能体正日益自动化处理复杂的信息检索任务,通过多步推理、工具使用和综合分析生成基于证据的报告。随着它们作用的增强,我们需要可扩展且可靠的评估方法,这使得“LLM 作为裁判”(LLM-as-judge)成为评估事实准确性、证据使用和推理质量的监督范式。然而,这些裁判对于深度研究智能体的可靠性仍知之甚少,这提出了一个关键的元评估问题:在部署 LLM 裁判来监督研究智能体之前,我们必须首先评估裁判本身。

Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models.

现有的元评估存在两个不足:(1)依赖粗略、主观的人类偏好一致性;(2)侧重于指令遵循或可验证的任务,而未探索开放式的智能体执行过程。为了解决这些差距,我们引入了 REFLECT(通过受控干预进行可靠的细粒度 LLM 裁判评估),这是一个针对智能体环境中细粒度故障检测的元评估基准。REFLECT 定义了过程级和结果级故障模式的详细分类法,并通过对经过质量筛选的智能体执行轨迹进行受控和局部干预来实现。这为验证裁判模型提供了可验证、全面且细粒度的实例。

Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.

我们的实验表明,当前的 LLM 裁判仍然不可靠:即使是表现最好的模型,在推理、工具使用和报告质量故障方面的总体准确率也低于 55%,在证据验证方面的表现尤为糟糕。总之,我们的分类法和研究结果揭示了裁判系统的局限性,展示了成本与可靠性之间的权衡,并为构建更可靠的深度研究智能体评估流程提供了可操作的指导。