Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Lean4Agent：智能体工作流与轨迹的形式化建模与验证

Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs’ agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. 使大语言模型（LLMs）能够执行可靠的多步工作流已成为人工智能领域的核心挑战。尽管大模型在智能体能力方面取得了最新进展，但大多数智能体系统仍然缺乏用于规范、验证和调试其工作流及执行轨迹的形式化方法。

This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose Lean4Agent, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. 这一挑战反映了数学中一个长期存在的问题，即自然语言（NL）的歧义性推动了形式语言（FL）的发展。受此范式的启发，我们提出了 Lean4Agent。据我们所知，这是首个使用 Lean4（一种依赖类型形式语言）来建模和验证智能体行为的框架。

Lean4Agent launches FormalAgentLib, an extensible Lean4 library for formally modeling and verifying agent workflows’ semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on FormalAgentLib, we further develop LeanEvolve, which applies results in FormalAgentLib to revise workflows to enhance its capability. Lean4Agent 推出了 FormalAgentLib，这是一个可扩展的 Lean4 库，用于在明确假设下对智能体工作流的语义一致性进行形式化建模与验证，并能够定位由轨迹所揭示的执行时故障。基于 FormalAgentLib，我们进一步开发了 LeanEvolve，它应用 FormalAgentLib 的结果来修订工作流，从而增强其能力。

Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of 11.94%, and LeanEvolve further improves SWE performance by 7.47% on average. Furthermore, Lean4Agent establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior. 在 5 个领先的大语言模型上，针对 SWE-Bench-Verified 的困难问题子集和 ELAIP-Bench 的子集进行的广泛实验表明，通过验证的工作流比未通过的工作流平均表现高出 11.94%，而 LeanEvolve 进一步将 SWE 性能平均提升了 7.47%。此外，Lean4Agent 为使用表达能力强的依赖类型形式语言来形式化建模和验证智能体行为这一新领域奠定了基础。