MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

MIST:用于智能家居的多模态交互式语音工具调用对话助手

Abstract: The rise of Internet of Things (IoT) devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool-usage capabilities, modeling real-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns.

摘要: 物理世界中物联网(IoT)设备的兴起,使得人们迫切需要能够处理复杂用户体验的语音交互界面。尽管现代大语言模型(LLM)已经展现出强大的工具使用能力,但对现实世界中的物联网设备进行建模仍是一项困难且研究不足的挑战。该挑战结合了时空约束建模、语音输入、动态状态跟踪以及混合主动交互模式等多个维度。

We introduce MIST (the Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic multi-turn, voice-driven code generation task that operates over IoT devices. We find that there is a significant gap between open- and closed-weight multimodal LLMs on MIST, and that even frontier closed-weight LLMs have substantial headroom.

我们引入了 MIST(多模态交互式语音工具调用数据集),这是一个针对物联网设备运行的合成式、多轮、语音驱动的代码生成任务。我们发现,在 MIST 任务上,开源与闭源多模态大语言模型之间存在显著差距,且即便是前沿的闭源大语言模型仍有巨大的提升空间。

We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed-initiative voice assistants which reason about physical world constraints.

我们发布了 MIST 以及一个可扩展的数据生成框架,旨在构建相关数据集,以促进对能够推理物理世界约束的混合主动式语音助手的相关研究。


Paper Details:

  • Authors: Maximillian Chen, Xuanming Zhang, Michael Peng, Zhou Yu, Alexandros Papangelis, Yohan Jo
  • Submission Date: 7 May 2026
  • Primary Category: Computation and Language (cs.CL)

论文详情:

  • 作者: Maximillian Chen, Xuanming Zhang, Michael Peng, Zhou Yu, Alexandros Papangelis, Yohan Jo
  • 提交日期: 2026年5月7日
  • 主要分类: 计算与语言 (cs.CL)