Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

利用概率程序训练大语言模型的归纳推理能力

Abstract: Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer uncertain beliefs from sparse, ambiguous observations. There are challenges to using standard fine-tuning methods for inductive reasoning, including difficulties in curating large-scale, high-quality labeled datasets and in handling targets that are inherently distributional.

摘要： 对大语言模型（LLM）进行推理能力的后训练，通常侧重于数学和编程等正确性可验证的演绎任务。然而，许多现实世界的推理问题属于归纳推理：智能体必须从稀疏、模糊的观察中推断出不确定的信念。使用标准的微调方法进行归纳推理存在诸多挑战，包括难以策划大规模、高质量的标注数据集，以及难以处理本质上呈分布式的目标。

In this work, we introduce a novel approach, called Program-based Posterior Training (PPT), to address these limitations: we use an LLM to generate diverse open-world scenarios as probabilistic programs, run probabilistic inference to produce distributional target responses to queries, and then fine-tune on these probabilistic soft labels.

在这项工作中，我们引入了一种名为“基于程序的后验训练”（Program-based Posterior Training, PPT）的新方法来解决这些局限性：我们利用大语言模型生成各种开放世界的场景作为概率程序，运行概率推理以产生针对查询的分布式目标响应，然后利用这些概率软标签进行微调。

Using this approach, we fine-tune LLMs on 10,000 programmatically generated scenarios and evaluate on held-out motifs, human-labeled judgments, and external benchmarks. Overall, PPT substantially improves estimation accuracy on held-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration.

通过这种方法，我们在 10,000 个程序生成的场景上对大语言模型进行了微调，并在留出的基序（motifs）、人工标注的判断以及外部基准测试上进行了评估。总体而言，PPT 显著提高了在留出归纳任务上的估计准确性，增强了与人类判断的一致性，并成功迁移到用于估计和校准的外部基准测试中。

Additionally, the gains in raw calibration are not subsumed by post-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling. Together, these results suggest that probabilistic-program-mediated fine-tuning is a promising approach for post-training LLMs to reliably perform approximate inductive inference.

此外，原始校准的提升并不能被事后的温度缩放（temperature scaling）所取代，这表明与输出重缩放相比，模型已经更深入地内化了不确定性。总之，这些结果表明，以概率程序为媒介的微调是训练大语言模型可靠地执行近似归纳推理的一种有前景的方法。