IntentGrasp: A Comprehensive Benchmark for Intent Understanding

IntentGrasp：意图理解的综合基准测试

Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification.

准确理解语音、对话和写作背后的意图，对于开发有帮助的大语言模型（LLM）助手至关重要。本文介绍了 IntentGrasp，这是一个用于评估大语言模型意图理解能力的综合基准测试。IntentGrasp 源自涵盖 12 个不同领域的 49 个高质量、开源许可语料库，通过源数据集整理、意图标签情境化以及任务格式统一构建而成。

IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement.

IntentGrasp 包含一个拥有 262,759 个实例的大规模训练集，以及两个评估集：包含 12,909 个测试用例的“All Set”，以及一个更平衡、更具挑战性的、包含 470 个用例的“Gem Set”。对 7 个系列共 20 个大语言模型（包括 GPT-5.4、Gemini-3.1-Pro 和 Claude-Opus-4.7 等前沿模型）进行的广泛评估显示，模型表现并不理想：在 All Set 上的得分低于 60%，在 Gem Set 上的得分低于 25%。值得注意的是，在 20 个受测模型中，有 17 个在 Gem Set 上的表现甚至不如随机猜测基准（15.2%），而人类的预估表现约为 81.1%，这表明模型仍有巨大的提升空间。

To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs.

为了增强这一能力，本文提出了“意图微调”（Intentional Fine-Tuning, IFT），通过在 IntentGrasp 的训练集上对模型进行微调，在 All Set 上取得了超过 30 个 F1 点的显著提升，在 Gem Set 上也提升了超过 20 个点。更有说服力的是，“留一领域法”（Leave-one-domain-out, Lodo）实验进一步证明了 IFT 强大的跨领域泛化能力，验证了这是一种能显著增强大语言模型意图理解能力的有效方法。

Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.

总而言之，通过对意图理解能力进行基准测试和提升，本研究为开发更具意图感知能力、更强大且更安全的 AI 助手指明了一条充满希望的道路，从而造福人类并促进社会福祉。