LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

基于大语言模型的科学同行评审:方法、基准与可靠性挑战

Abstract: The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits, motivating the exploration of large language models (LLMs) as intelligent automated evaluation assistants. Although recent studies show that LLMs can generate fluent critiques and approximate reviewer scores, their reliability, robustness, and security as decision-support systems remain insufficiently understood.

摘要: 科学投稿数量的激增已将传统同行评审推向了可扩展性的极限,这促使人们探索将大语言模型(LLMs)作为智能自动化评估助手。尽管近期研究表明,大语言模型能够生成流畅的评审意见并近似模拟评审员的评分,但作为决策支持系统,其可靠性、鲁棒性和安全性仍未得到充分理解。

This survey offers a systems-level analysis of LLM-based scientific peer review, focusing on two core evaluative functions: critique generation and score prediction. We present a structured taxonomy of modeling approaches (including prompt-based, supervised, retrieval-augmented, and alignment-optimized approaches), and synthesize empirical findings across existing benchmarks.

本综述从系统层面分析了基于大语言模型的科学同行评审,重点关注两个核心评估功能:评审意见生成与评分预测。我们提出了建模方法的结构化分类(包括基于提示、监督学习、检索增强和对齐优化等方法),并综合了现有基准测试中的实证研究结果。

We analyze dataset constraints, evaluation shortcomings, and domain concentration biases that limit current assessment practices. Beyond performance metrics, we identify emerging robustness risks, including prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking, which expose automated review pipelines to strategic manipulation.

我们分析了限制当前评估实践的数据集约束、评估缺陷以及领域集中偏差。除了性能指标外,我们还识别了新兴的鲁棒性风险,包括提示注入、数据投毒、检索漏洞和奖励欺骗,这些风险使自动化评审流程面临被策略性操纵的威胁。

From a data mining perspective, we outline key open challenges in modeling subjective disagreement and cross-domain generalization. By reframing automated peer review as a high-stakes, multi-objective decision problem, this survey provides a roadmap for developing robust, transparent, and trustworthy AI-assisted scientific evaluation systems.

从数据挖掘的角度,我们概述了在建模主观分歧和跨领域泛化方面面临的关键挑战。通过将自动化同行评审重新定义为一个高风险、多目标的决策问题,本综述为开发稳健、透明且可信的 AI 辅助科学评估系统提供了路线图。