Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

Shopping Reasoning Bench：面向多轮对话购物助手的专家编写基准测试

Abstract: Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks.

摘要： 对话式购物助手目前已服务数以亿计的客户，然而现有的基准测试尚无法全面评估真实购物对话所要求的开放式多轮推理、领域专业知识以及标准层面的质量。在语言模型应用中，购物推理具有独特性。与事实问答或可验证的代码生成不同，它需要在多轮对话中平衡主观偏好、预算限制和跨产品权衡，而这些能力在以往的电子商务和通用基准测试中均有所缺失。

We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment.

我们推出了“Shopping Reasoning Bench”，这是一个由零售领域专家编写的基准测试，包含 525 项任务（232 项单轮任务，293 项多轮任务），并附有由专家制定的 10863 条加权二元评估准则。这些准则被组织在五个推理类别和十五个子类别的分类体系下，涵盖了偏好细化、权衡分析和兼容性评估等多种需求。

An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57—77% overall. On multi-turn missions, all models score 13—29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4—18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.

对三个系列（GPT、Claude、Gemini）共九个模型的评估显示，整体通过率仅为 57% 至 77%。在多轮任务中，所有模型在“进阶可选标准”上的得分比“必要标准”低 13 至 29 分，且随着对话的深入，性能会下降 4 至 18 分。这些差距表明，当前模型虽能处理基础的购物辅助，但尚无法提供专家级的建议，这使得 Shopping Reasoning Bench 成为未来购物助手开发中极具挑战性的测试平台。