ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

ReacTOD：用于零样本对话状态追踪的受限神经符号代理自然语言理解 (NLU)

Abstract: Task-oriented dialogue systems — handling transactions, reservations, and service requests — require predictable behavior, yet the moderately-sized LLMs needed for practical latency are prone to hallucination and format errors that cascade into incorrect actions (e.g., a hotel booked for the wrong date). We propose ReacTOD, a bounded neuro-symbolic architecture that reformulates NLU as discrete tool calls within a self-correcting ReAct loop governed by deterministic validation.

摘要： 任务导向型对话系统（处理交易、预订和服务请求）需要可预测的行为，然而，为了满足实际延迟需求而使用的中等规模大语言模型（LLM）容易产生幻觉和格式错误，进而导致错误的动作（例如，酒店预订日期错误）。我们提出了 ReacTOD，这是一种受限的神经符号架构，它将自然语言理解（NLU）重新定义为在确定性验证机制控制下的自修正 ReAct 循环中的离散工具调用。

A bounded ReAct loop enables iterative self-correction, improving accuracy by up to 9.3 percentage points over single-pass inference on MultiWOZ. A symbolic validator enforces action compliance, schema conformance, and coreference consistency on every dialogue state update, achieving a 93.1% self-correction rate on intercepted errors and producing structured execution traces. Incremental state prediction and on-demand history retrieval keep prompts compact, empirically improving instruction adherence in parameter-constrained models.

受限的 ReAct 循环实现了迭代式自修正，在 MultiWOZ 数据集上，其准确率比单次推理提高了 9.3 个百分点。符号验证器在每次对话状态更新时强制执行动作合规性、模式一致性和指代一致性，在拦截到的错误中实现了 93.1% 的自修正率，并生成了结构化的执行轨迹。增量状态预测和按需历史检索保持了提示词（prompt）的简洁性，在经验上提高了参数受限模型对指令的遵循能力。

On MultiWOZ 2.1, ReacTOD achieves a new zero-shot state-of-the-art: gpt-oss-20B reaches 52.71% joint goal accuracy, surpassing the previous best by 14 percentage points, while Qwen3-8B achieves 47.34% with only 8B parameters. On the Schema-Guided Dialogue (SGD) benchmark, ReacTOD with Claude-Opus-4.6 achieves 80.68% JGA under fully end-to-end evaluation with predicted domains, and Qwen3-32B reaches 64.09% — demonstrating cross-benchmark generalization without task-specific training data.

在 MultiWOZ 2.1 上，ReacTOD 达到了新的零样本（zero-shot）技术水平：gpt-oss-20B 的联合目标准确率（JGA）达到 52.71%，比之前的最佳成绩提高了 14 个百分点；而 Qwen3-8B 在仅有 80 亿参数的情况下达到了 47.34%。在模式引导对话（SGD）基准测试中，使用 Claude-Opus-4.6 的 ReacTOD 在预测领域的全端到端评估中达到了 80.68% 的 JGA，Qwen3-32B 则达到了 64.09% —— 这证明了在没有特定任务训练数据的情况下，该模型具备跨基准测试的泛化能力。