PrologMCP: A Standardized Prolog Tool Interface for LLM Agents

Abstract: Frontier reasoning-tuned language models still fail on deductive tasks at depth, and the cost of improved performance through extended internal reasoning scales poorly. Symbolic delegation offers a complementary route: a language model translates the problem, while a solver performs the inference. However, current autoformalization pipelines for logic programming are typically bespoke integrations tied to particular tasks or agents.

摘要： 前沿的推理优化语言模型在处理深层演绎任务时仍然会失败，且通过扩展内部推理来提升性能的成本难以有效扩展。符号委托提供了一条互补的路径：由语言模型翻译问题，再由求解器执行推理。然而，目前用于逻辑编程的自动形式化流水线通常是与特定任务或智能体绑定的定制化集成。

We introduce PrologMCP, a task-agnostic, open-source server that exposes Prolog as a stateful tool through the Model Context Protocol (MCP). Its compact tool interface, structured error reporting, and per-session isolation make the translate-run-inspect-repair loop a reusable primitive for MCP-capable agents.

我们推出了 PrologMCP，这是一个与任务无关的开源服务器，通过模型上下文协议（MCP）将 Prolog 作为一种有状态工具公开。其紧凑的工具接口、结构化的错误报告以及会话隔离机制，使得“翻译-运行-检查-修复”循环成为 MCP 兼容智能体的一种可复用原语。

We evaluate a formalizer agent enhanced with PrologMCP against standard and reasoning LLMs (Claude Sonnet 4.6, GPT-4.1, and o4-mini) on two subsets of PARARULE-Plus: a general-purpose sample and a more challenging one targeting a specific failure mode of natural-language reasoning. On the general sample, the formalizer matches or exceeds reasoning LLMs (accuracy 1.00 vs. 1.00 / 0.998), with the largest gains over standard models (0.762 for GPT-4.1).

我们在 PARARULE-Plus 的两个子集上，将由 PrologMCP 增强的形式化智能体与标准及推理型大模型（Claude Sonnet 4.6、GPT-4.1 和 o4-mini）进行了对比评估：一个是通用样本，另一个是针对自然语言推理特定失效模式的更具挑战性的样本。在通用样本上，该形式化智能体的表现与推理型大模型持平或更优（准确率 1.00 对比 1.00 / 0.998），且相比标准模型有显著提升（GPT-4.1 为 0.762）。

On the challenging subset, the formalizer remains near-perfect (1.00 / 0.99) while reasoning LLMs drop to 0.95 / 0.94. These results suggest that delegating inference to Prolog via MCP is a robust and inspectable alternative to extended natural-language reasoning.

在挑战性子集上，该形式化智能体依然保持近乎完美的表现（1.00 / 0.99），而推理型大模型则下降至 0.95 / 0.94。这些结果表明，通过 MCP 将推理委托给 Prolog，是替代扩展自然语言推理的一种稳健且可检查的方案。