ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

ReacTOD:用于零样本对话状态追踪的受限神经符号代理自然语言理解 (NLU)

Abstract: Task-oriented dialogue systems — handling transactions, reservations, and service requests — require predictable behavior, yet the moderately-sized LLMs needed for practical latency are prone to hallucination and format errors that cascade into incorrect actions (e.g., a hotel booked for the wrong date). We propose ReacTOD, a bounded neuro-symbolic architecture that reformulates NLU as discrete tool calls within a self-correcting ReAct loop governed by deterministic validation.

摘要: 任务导向型对话系统(处理交易、预订和服务请求)需要可预测的行为,然而,为了满足实际延迟需求而使用的中等规模大语言模型(LLM)容易产生幻觉和格式错误,进而导致错误的动作(例如,酒店预订日期错误)。我们提出了 ReacTOD,这是一种受限的神经符号架构,它将自然语言理解(NLU)重新定义为在确定性验证机制控制下的自修正 ReAct 循环中的离散工具调用。

A bounded ReAct loop enables iterative self-correction, improving accuracy by up to 9.3 percentage points over single-pass inference on MultiWOZ. A symbolic validator enforces action compliance, schema conformance, and coreference consistency on every dialogue state update, achieving a 93.1% self-correction rate on intercepted errors and producing structured execution traces. Incremental state prediction and on-demand history retrieval keep prompts compact, empirically improving instruction adherence in parameter-constrained models.

受限的 ReAct 循环实现了迭代式自修正,在 MultiWOZ 数据集上,其准确率比单次推理提高了 9.3 个百分点。符号验证器在每次对话状态更新时强制执行动作合规性、模式一致性和指代一致性,在拦截到的错误中实现了 93.1% 的自修正率,并生成了结构化的执行轨迹。增量状态预测和按需历史检索保持了提示词(prompt)的简洁性,在经验上提高了参数受限模型对指令的遵循能力。

On MultiWOZ 2.1, ReacTOD achieves a new zero-shot state-of-the-art: gpt-oss-20B reaches 52.71% joint goal accuracy, surpassing the previous best by 14 percentage points, while Qwen3-8B achieves 47.34% with only 8B parameters. On the Schema-Guided Dialogue (SGD) benchmark, ReacTOD with Claude-Opus-4.6 achieves 80.68% JGA under fully end-to-end evaluation with predicted domains, and Qwen3-32B reaches 64.09% — demonstrating cross-benchmark generalization without task-specific training data.

在 MultiWOZ 2.1 上,ReacTOD 达到了新的零样本(zero-shot)技术水平:gpt-oss-20B 的联合目标准确率(JGA)达到 52.71%,比之前的最佳成绩提高了 14 个百分点;而 Qwen3-8B 在仅有 80 亿参数的情况下达到了 47.34%。在模式引导对话(SGD)基准测试中,使用 Claude-Opus-4.6 的 ReacTOD 在预测领域的全端到端评估中达到了 80.68% 的 JGA,Qwen3-32B 则达到了 64.09% —— 这证明了在没有特定任务训练数据的情况下,该模型具备跨基准测试的泛化能力。