Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Review Arcade：关于大语言模型评审的人类对齐与可博弈性研究

Abstract: LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting.

摘要： 由大语言模型（LLM）生成的学术论文评审正受到广泛关注，甚至已被各大顶级会议正式试点采用。我们必须预见到，不仅评审人员在使用大语言模型辅助工作，作者们也在提交论文前利用大语言模型对稿件进行润色。

In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models.

在这项研究中，我们对 2025 年 ACL 滚动评审（ARR）的论文进行了实证实验，从作者和评审者两个视角评估了 LLM 的评审表现。首先，我们发现 LLM 评审与人类评审之间的一致性有限。在理想情况下，这种一致性尚可接受；然而，我们也发现 LLM 与人类的一致性在不同提示词（prompts）和模型之间存在显著差异。

Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this “gaming” of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35% of papers. We publish our code: this https URL.

最后，我们研究了作者利用“草稿-修订”的迭代工作流，根据 LLM 的评审意见来优化论文的情境。研究发现，这种对 LLM 评审的“博弈”在特定场景下是有效的，能够使高达 35% 的论文在总分上获得统计学意义上的显著提升。我们已公开了相关代码。