IntentGrasp: A Comprehensive Benchmark for Intent Understanding
IntentGrasp: A Comprehensive Benchmark for Intent Understanding
IntentGrasp:意图理解的综合基准测试
Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification.
准确理解语音、对话和写作背后的意图,对于开发有帮助的大语言模型(LLM)助手至关重要。本文介绍了 IntentGrasp,这是一个用于评估大语言模型意图理解能力的综合基准测试。IntentGrasp 源自涵盖 12 个不同领域的 49 个高质量、开源许可语料库,通过源数据集整理、意图标签情境化以及任务格式统一构建而成。
IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement.
IntentGrasp 包含一个拥有 262,759 个实例的大规模训练集,以及两个评估集:包含 12,909 个测试用例的“All Set”,以及一个更平衡、更具挑战性的、包含 470 个用例的“Gem Set”。对 7 个系列共 20 个大语言模型(包括 GPT-5.4、Gemini-3.1-Pro 和 Claude-Opus-4.7 等前沿模型)进行的广泛评估显示,模型表现并不理想:在 All Set 上的得分低于 60%,在 Gem Set 上的得分低于 25%。值得注意的是,在 20 个受测模型中,有 17 个在 Gem Set 上的表现甚至不如随机猜测基准(15.2%),而人类的预估表现约为 81.1%,这表明模型仍有巨大的提升空间。
To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs.
为了增强这一能力,本文提出了“意图微调”(Intentional Fine-Tuning, IFT),通过在 IntentGrasp 的训练集上对模型进行微调,在 All Set 上取得了超过 30 个 F1 点的显著提升,在 Gem Set 上也提升了超过 20 个点。更有说服力的是,“留一领域法”(Leave-one-domain-out, Lodo)实验进一步证明了 IFT 强大的跨领域泛化能力,验证了这是一种能显著增强大语言模型意图理解能力的有效方法。
Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.
总而言之,通过对意图理解能力进行基准测试和提升,本研究为开发更具意图感知能力、更强大且更安全的 AI 助手指明了一条充满希望的道路,从而造福人类并促进社会福祉。