Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

为什么大语言模型在因果发现上会失败，以及干预型智能体如何破局

Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established.

因果发现是科学推理的基石，然而大语言模型能否可靠地执行这一任务仍是一个悬而未决的问题。近期的基准测试表明，即使是经过微调的模型，在处理简单的因果图时也会遇到性能瓶颈，且随着复杂度的增加，其表现会进一步下降，但导致这种失败的原因此前尚未明确。

We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model’s internal representations to grow unboundedly, violating the very conditions under which these methods work.

我们证明了这种失败是根本性的：监督微调（SFT）、直接偏好优化（DPO）以及上下文学习（In-context learning）所产生的预测器，都无法区分生成相似观测数据的不同因果图。任何试图解决这一问题的尝试，都需要模型内部表示无限增长，这违背了这些方法赖以生存的前提条件。

We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, not any particular model or dataset. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds.

我们将此形式化为“核阻碍定理”（kernel obstruction theorem），确立了这种局限性是学习范式本身固有的，而非针对任何特定模型或数据集。我们提出了“智能体因果贝叶斯优化”（A-CBO），其中冻结的语言模型充当干预预言机，回答关于干预效果的针对性查询，而外部的贝叶斯循环则在对数级的轮次内，将置信度集中在候选因果图上。

Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing.

由于该决策过程运行在上述“阻碍”适用的空间之外，A-CBO 在保持底层模型不变的情况下实现了可证明的收敛。在 Corr2Cause 测试集上，A-CBO 无需任何训练即可达到微调基准的水平。在扩展后的 Corr2Cause（一个扩展至 24 个变量、包含 1.8 万个测试样本的新基准）上，A-CBO 的表现显著优于微调和偏好优化方法，且随着规模扩大，其优势愈发明显。