Can LLM Teams Play What? Where? When?
Can LLM Teams Play What? Where? When?
LLM 团队能玩转《什么?哪里?何时?》吗?
Abstract: Large language models (LLMs) remain limited on tasks requiring indirect reasoning, cultural knowledge, and coordinated hypothesis testing. We investigate whether team-based interaction improves LLM performance in What? Where? When? (ChGK), a quiz game designed to reward collective reasoning.
摘要: 大型语言模型(LLM)在需要间接推理、文化知识和协同假设测试的任务上仍然存在局限性。我们研究了基于团队的交互是否能提高 LLM 在《什么?哪里?何时?》(ChGK)这一旨在奖励集体推理的问答游戏中的表现。
We introduce three team strategies: Voting, Silent Team (the captain observes final answers), and Talkative Team (the captain observes both answers and rationales). To minimize data leakage, we evaluate these strategies on a dataset consisting of 572 ChGK questions released in 2025.
我们引入了三种团队策略:投票(Voting)、沉默团队(Silent Team,队长仅观察最终答案)和健谈团队(Talkative Team,队长同时观察答案和推理过程)。为了最大限度地减少数据泄露,我们在包含 572 个 2025 年发布的 ChGK 问题的数据集上评估了这些策略。
Using six recent large-scale open models, we show that team-based strategies outperform single-model baselines, yielding gains of up to 20 percentage points in accuracy. The best team achieves 44.23% accuracy, and approaches human team performance on questions with available human statistics.
通过使用六个近期的大规模开源模型,我们证明了基于团队的策略优于单模型基准,准确率提升高达 20 个百分点。表现最好的团队达到了 44.23% 的准确率,在有相关人类统计数据的问题上,其表现已接近人类团队水平。
Analysis of inter-model diversity reveals that disagreement strongly predicts lower accuracy, but explanatory communication substantially mitigates performance drops. We further examine captain behavior and find no evidence of self-preference bias; access to peer rationales improves captain judgments.
对模型间多样性的分析表明,意见分歧是准确率下降的强预测指标,但解释性的沟通能显著缓解性能下滑。我们进一步检查了队长的行为,未发现自我偏好偏差的证据;获取同伴的推理过程能够改善队长的判断。
Overall, LLM teams function primarily as answer selection and error-filtering mechanisms rather than generators of novel solutions. Our findings highlight the importance of interaction and suggest adaptive strategies as a promising direction for multi-agent systems.
总的来说,LLM 团队主要发挥答案选择和错误过滤机制的作用,而非生成创新解决方案。我们的研究结果强调了交互的重要性,并提出自适应策略是多智能体系统的一个有前景的发展方向。