Residual Skill Optimization for Text-to-SQL Ensembles

面向 Text-to-SQL 集成的残差技能优化

Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures.

Text-to-SQL 集成方法通过生成多个 SQL 候选并进行筛选，其效果优于单一候选生成。然而，其有效性受到 Pass@K（即 K 个候选者中至少有一个正确的概率）的限制。现有的方法通常通过随机解码或提示词变体来启发式地获取多样性，这导致候选集往往充斥着相关联的错误。

We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K.

我们提出了 DivSkill-SQL，这是一个残差技能优化框架，无需模型微调即可构建互补的智能体 Text-to-SQL 集成：每一个新技能都在当前集成模型无法处理的示例上进行优化，从而在理论上确保了其对 Pass@K 的边际贡献。

On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts).

在 Spider2-Lite 基准测试中，与最强的集成基线相比，DivSkill-SQL 在 Snowflake 上将选择准确率提升了高达 11.1 个百分点，在 BigQuery 上提升了 8.3 个百分点，并在两个基础模型（Opus-4.6 和 GPT-5.4）上均表现出持续的性能提升。在单一方言上优化的技能无需重新训练即可迁移到其他方言（Snowflake、BigQuery、SQLite）以及不同的任务形式（如 BIRD-Critic，提升了 2.6 个百分点）。

Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.

错误诊断显示，幻觉模式引用和函数调用的次数减少了多达 3 倍，这表明性能提升源于真正可靠的互补技能，而非仅仅是表面形式的变化。