CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing
CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing
CreativityBench:通过基于可供性(Affordance)的工具重用评估智能体的创造性推理
Recent advances in large language models have led to strong performance on reasoning and environment-interaction tasks, yet their ability for creative problem-solving remains underexplored. We study this capability through the lens of creative tool use, where a model repurposes available objects by reasoning about their affordances and attributes rather than relying on canonical usage. 大型语言模型的最新进展使其在推理和环境交互任务中表现出色,但它们在创造性问题解决方面的能力仍未得到充分探索。我们通过“创造性工具使用”的视角研究了这一能力,即模型不再依赖物体的常规用法,而是通过推理其可供性(affordances)和属性来重新利用现有物体。
As a first step, we introduce CreativityBench, a benchmark for evaluating affordance-based creativity in LLMs. To this end, we build a large-scale affordance knowledge base (KB) with 4K entities and 150K+ affordance annotations, explicitly linking objects, parts, attributes, and actionable uses. Building on this KB, we generate 14K grounded tasks that require identifying non-obvious yet physically plausible solutions under constraints. 作为第一步,我们推出了 CreativityBench,这是一个用于评估大语言模型(LLM)中基于可供性的创造力的基准测试。为此,我们构建了一个包含 4000 个实体和超过 15 万条可供性注释的大规模可供性知识库(KB),明确关联了物体、部件、属性和可操作的用途。基于此知识库,我们生成了 1.4 万个基础任务,要求模型在约束条件下识别出非显而易见但物理上可行的解决方案。
Evaluations across 10 state-of-the-art LLMs, including closed and open-source models, show that models can often select a plausible object, but fail to identify the correct parts, their affordances, and the underlying physical mechanism needed to solve the task, leading to a significant drop in performance. 对 10 个最先进的大语言模型(包括闭源和开源模型)的评估显示,模型通常能够选择一个看似合理的物体,但往往无法识别出解决任务所需的正确部件、其可供性以及底层的物理机制,这导致了性能的显著下降。
Furthermore, improvements from model scaling quickly saturate, strong general reasoning does not reliably translate to creative affordance discovery, and common inference-time strategies such as Chain-of-Thought yield limited gains. These results suggest that creative tool use remains a major challenge for current models, and that CreativityBench provides a useful testbed for studying this missing dimension of intelligence, with potential implications for planning and reasoning modules in future agents. 此外,模型规模扩展带来的性能提升很快就会饱和,强大的通用推理能力并不能可靠地转化为创造性的可供性发现,而诸如“思维链”(Chain-of-Thought)等常见的推理时策略所带来的收益也十分有限。这些结果表明,创造性工具使用对当前模型而言仍是一项重大挑战;CreativityBench 为研究这一缺失的智能维度提供了一个有用的测试平台,并对未来智能体的规划和推理模块具有潜在的启示意义。