Review Arcade: On the Human Alignment and Gameability of LLM Reviews
Review Arcade: On the Human Alignment and Gameability of LLM Reviews
Review Arcade:关于大语言模型评审的人类对齐与可博弈性研究
Abstract: LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting.
摘要: 由大语言模型(LLM)生成的学术论文评审正受到广泛关注,甚至已被各大顶级会议正式试点采用。我们必须预见到,不仅评审人员在使用大语言模型辅助工作,作者们也在提交论文前利用大语言模型对稿件进行润色。
In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models.
在这项研究中,我们对 2025 年 ACL 滚动评审(ARR)的论文进行了实证实验,从作者和评审者两个视角评估了 LLM 的评审表现。首先,我们发现 LLM 评审与人类评审之间的一致性有限。在理想情况下,这种一致性尚可接受;然而,我们也发现 LLM 与人类的一致性在不同提示词(prompts)和模型之间存在显著差异。
Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this “gaming” of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35% of papers. We publish our code: this https URL.
最后,我们研究了作者利用“草稿-修订”的迭代工作流,根据 LLM 的评审意见来优化论文的情境。研究发现,这种对 LLM 评审的“博弈”在特定场景下是有效的,能够使高达 35% 的论文在总分上获得统计学意义上的显著提升。我们已公开了相关代码。