Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations

Title: Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations 标题： 心智理论（ToM）的提升真的有益于人机交互吗？来自交互式评估的实证研究

Abstract: Improving the Theory of Mind (ToM) capability of Large Language Models (LLMs) is crucial for effective social interactions between these AI models and humans. However, the existing benchmarks often measure ToM capability improvement through story-reading, multiple-choice questions from a third-person perspective, while ignoring the first-person, dynamic, and open-ended nature of human-AI (HAI) interactions. 摘要： 提升大语言模型（LLM）的心智理论（ToM）能力对于实现AI模型与人类之间有效的社交互动至关重要。然而，现有的基准测试通常通过阅读故事或第三人称视角的选择题来衡量ToM能力的提升，却忽略了人机交互（HAI）中第一人称、动态且开放的本质。

To directly examine how ToM improvement techniques benefit HAI interactions, we first proposed the new paradigm of interactive ToM evaluation with both perspective and metric shifts. Next, following the paradigm, we conducted a systematic study of four representative ToM enhancement techniques using both four real-world datasets and a user study, covering both goal-oriented tasks (e.g., coding, math) and experience-oriented tasks (e.g., counseling). 为了直接检验ToM提升技术如何惠及人机交互，我们首先提出了一种新的交互式ToM评估范式，该范式在视角和度量标准上均进行了转变。随后，遵循这一范式，我们利用四个真实世界数据集和一项用户研究，对四种代表性的ToM增强技术进行了系统性研究，涵盖了目标导向任务（如编程、数学）和体验导向任务（如心理咨询）。

Our findings reveal that improvements on static benchmarks do not always translate to better performance in dynamic HAI interactions. This paper offers critical insights into ToM evaluation, showing the necessity of interaction-based assessments in developing next-generation, socially aware LLMs for HAI symbiosis. 我们的研究结果表明，在静态基准测试上的提升并不总是能转化为动态人机交互中的更好表现。本文为ToM评估提供了关键见解，证明了在开发用于人机共生的下一代具备社交意识的LLM时，基于交互的评估是必不可少的。