GPTNT: Benchmarking Real-Time Collaboration Between Multimodal Agents on Keep Talking And Nobody Explodes
GPTNT: Benchmarking Real-Time Collaboration Between Multimodal Agents on Keep Talking And Nobody Explodes
GPTNT:在《保持通话,没人爆炸》游戏中对多模态智能体实时协作能力的基准测试
Multimodal models are increasingly deployed to solve tasks collaboratively with humans or other artificial agents. Existing benchmarks show that these models possess many of the required component capabilities, but the conditions that coincide in collaboration, including time pressure, information asymmetry, and imperfect communication, are usually studied in isolation.
多模态模型正越来越多地被部署,以与人类或其他人工智能体协作完成任务。现有的基准测试表明,这些模型具备许多所需的组件能力,但协作中同时出现的各种条件——包括时间压力、信息不对称和不完美的沟通——通常都是被孤立研究的。
We introduce GPTNT, a benchmark built on the cooperative video game Keep Talking and Nobody Explodes, in which two agents must coordinate to defuse procedurally generated bomb puzzles against a live countdown. One agent can see and manipulate the bomb but does not have the defusal instructions; the other has the instructions but cannot see or manipulate the bomb. Neither agent can succeed alone: success requires effective and efficient communication.
我们推出了 GPTNT,这是一个基于合作类电子游戏《保持通话,没人爆炸》(Keep Talking and Nobody Explodes)构建的基准测试。在游戏中,两个智能体必须相互配合,在实时倒计时中拆除程序生成的炸弹谜题。其中一个智能体可以看到并操作炸弹,但没有拆除说明;另一个智能体拥有说明书,但无法看到或操作炸弹。没有任何一个智能体能独自成功:成功需要有效且高效的沟通。
Unlike turn-based proxies, GPTNT requires agents to act asynchronously and communicate in real time. GPTNT is designed to separate collaboration from reliance on memorized solutions: the instruction manual, the partner, or both can be withheld to isolate what a model derives in the moment from what it already knows.
与基于回合制的代理任务不同,GPTNT 要求智能体异步行动并进行实时沟通。GPTNT 的设计旨在将协作能力与对记忆方案的依赖区分开来:通过隐藏说明书、更换搭档或两者同时进行,从而将模型即时推导出的能力与它已有的知识储备分离开来。
We show that GPTNT poses a substantial challenge for state-of-the-art systems: none of the closed- or open-source models we test defuses a single bomb in real time, a bar that human players clear. Through controlled experiments, we identify critical weaknesses in state tracking, efficient action under time pressure, ambiguity handling, and error recovery.
研究表明,GPTNT 对当前最先进的系统构成了巨大挑战:在我们测试的所有闭源或开源模型中,没有一个能在实时环境下成功拆除哪怕一个炸弹,而人类玩家却能轻松做到这一点。通过对照实验,我们发现了模型在状态追踪、时间压力下的高效行动、歧义处理以及错误恢复方面的关键弱点。
We release GPTNT as a benchmark for collaborative performance that current evaluations leave unmeasured. Because it runs on the real game, GPTNT benefits from procedural generation and inherits a living modding community, allowing the benchmark to evolve as models improve rather than being solved once and retired.
我们发布 GPTNT 作为衡量协作表现的基准,填补了当前评估体系的空白。由于它运行在真实游戏之上,GPTNT 受益于程序化生成技术,并继承了一个活跃的模组(modding)社区,这使得该基准测试能够随着模型能力的提升而不断演进,而不是被“破解”一次后就失去价值。