Test-Time Verification for Text-to-SQL via Outcome Reward Models

通过结果奖励模型实现 Text-to-SQL 的推理时验证

Abstract: Improving the reliability of large language models (LLMs) at inference time is a central challenge in structured reasoning tasks such as Text-to-SQL. Common test-time inference strategies, including Best-of-N sampling and Majority Voting, rely on heuristic signals such as execution success or output frequency, which provide limited semantic discrimination across candidate outputs.

摘要： 在推理阶段提高大语言模型（LLM）的可靠性是 Text-to-SQL 等结构化推理任务中的核心挑战。常见的推理时策略（包括 Best-of-N 采样和多数投票法）依赖于执行成功率或输出频率等启发式信号，这些信号在区分候选输出的语义方面能力有限。

In this work, we study Outcome Reward Models (ORMs) as learned semantic scoring functions for test-time verification in Text-to-SQL. While ORMs have been previously explored for test-time scaling and alignment, their application to structured query generation remains underexplored.

在这项工作中，我们研究了结果奖励模型（ORMs），将其作为一种学习到的语义评分函数，用于 Text-to-SQL 的推理时验证。尽管 ORMs 此前已被探索用于推理时扩展和对齐，但其在结构化查询生成中的应用仍未得到充分研究。

We introduce GradeSQL, a scalable framework for training task-specific ORMs via automated candidate generation and execution-based labeling, enabling verifier training without manual annotation. We integrate ORMs into a verification-driven Best-of-N pipeline and evaluate our approach on the BIRD and Spider benchmarks across multiple open-source LLM families.

我们引入了 GradeSQL，这是一个可通过自动化候选生成和基于执行的标注来训练任务特定 ORMs 的可扩展框架，从而实现了无需人工标注的验证器训练。我们将 ORMs 集成到以验证为驱动的 Best-of-N 流水线中，并在 BIRD 和 Spider 基准测试中对多个开源 LLM 系列进行了评估。

ORM-based selection consistently outperforms execution-based Best-of-N and Majority Voting, with gains of up to +4.33% on BIRD and +2.10% on Spider. We further show that ORMs scale effectively with larger candidate sets and yield stronger improvements on complex queries.

基于 ORM 的选择方法始终优于基于执行的 Best-of-N 和多数投票法，在 BIRD 上提升了高达 +4.33%，在 Spider 上提升了 +2.10%。我们进一步证明，ORMs 可以随着候选集规模的扩大而有效扩展，并在复杂查询上产生更显著的改进。

Overall, our results demonstrate that ORM-based verification provides a simple, effective, and scalable alternative to heuristic test-time selection strategies for Text-to-SQL. Code datasets and models are publicly available.

总的来说，我们的结果表明，对于 Text-to-SQL 任务，基于 ORM 的验证为启发式推理时选择策略提供了一种简单、有效且可扩展的替代方案。相关代码、数据集和模型已公开。