Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Pythagoras-Prover：通过增强型 Lean 形式化推进高效形式化证明

Abstract: Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verified proof data and the long reasoning traces of formal proof search, making both supervised fine-tuning (SFT) and sampling expensive.

摘要： 现代 Lean 定理证明器只有在投入大量的训练和推理计算资源后才能获得强大的性能。这部分归因于经过验证的证明数据稀缺，以及形式化证明搜索中冗长的推理轨迹，导致监督微调（SFT）和采样过程都非常昂贵。

We introduce Pythagoras-Prover, a compute-efficient open-source family of Lean theorem provers built for practical compute budgets. The family spans two generation paradigms: autoregressive models at 4B and 32B parameters, and a first proof-of-concept diffusion-based prover (4B) that iteratively refines Lean proofs at inference time.

我们推出了 Pythagoras-Prover，这是一个为实际计算预算而构建的计算高效型开源 Lean 定理证明器系列。该系列涵盖了两种生成范式：参数量分别为 4B 和 32B 的自回归模型，以及首个基于扩散模型的概念验证证明器（4B），它能够在推理阶段迭代优化 Lean 证明。

For training efficiency, we build a Lean-verified corpus stratified into easy, medium, and hard problems for curriculum SFT, so models acquire proof skills progressively from shorter, simpler proofs to longer, harder ones. During SFT, a dynamic proof-reasoning filtering scheme preserves informative proof traces while keeping each instance within an 8k-token context budget.

为了提高训练效率，我们构建了一个按简单、中等和困难问题分层的 Lean 验证语料库，用于课程化监督微调（Curriculum SFT），使模型能够从较短、较简单的证明逐步掌握到较长、较难的证明技能。在 SFT 过程中，一种动态的证明推理过滤方案在保留信息量大的证明轨迹的同时，将每个实例保持在 8k token 的上下文预算内。

We also introduce Augmented Lean Formalisation (ALF), which expands scarce verified corpora into variants of formal statements, populated via self-distillation for extra training signal without formally verifying every mutated instance. By perturbing known problems while preserving their formal character, ALF reduces reliance on any statement’s surface form.

我们还引入了增强型 Lean 形式化（Augmented Lean Formalisation, ALF），它通过自我蒸馏将稀缺的已验证语料库扩展为各种形式化语句变体，从而在无需对每个变异实例进行正式验证的情况下提供额外的训练信号。通过在保持问题形式特征的同时对已知问题进行扰动，ALF 降低了模型对语句表面形式的依赖。

Empirically, Pythagoras-Prover-4B surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test (86.1% vs 82.4%) with ~167x fewer parameters, while Pythagoras-Prover-32B sets the open-source state of the art at 93.0% on MiniF2F-Test and solves 93 of 672 PutnamBench problems. We release MiniF2F-ALF, an ALF-mutated contamination-sensitive benchmark on which every evaluated model loses accuracy; here our 32B remains strongest and our 4B matches the prior state of the art, Goedel-Prover-V2-32B.

实验结果表明，Pythagoras-Prover-4B 在 MiniF2F-Test 的 pass@32 指标上以约 167 倍更少的参数超越了 DeepSeek-Prover-V2-671B（86.1% 对 82.4%）；而 Pythagoras-Prover-32B 在 MiniF2F-Test 上以 93.0% 的准确率创下了开源领域的新纪录，并解决了 PutnamBench 中 672 道题目中的 93 道。我们还发布了 MiniF2F-ALF，这是一个经过 ALF 变异且对污染敏感的基准测试，所有参与评估的模型在该测试上的准确率均有所下降；其中我们的 32B 模型表现最强，而 4B 模型则与此前的最佳模型 Goedel-Prover-V2-32B 持平。