Can AI Agents Synthesize Scientific Conclusions?

AI 智能体能否综合得出科学结论？

Abstract: Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. 摘要： 科学类 AI 智能体正越来越多地被用于检索证据、跨来源推理并综合得出用于重大决策的结论。然而，它们在医疗等高风险领域执行此类任务的能力尚不明确。

We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. 我们推出了 SciConBench，这是一个包含 9,110 个问题及系统综述中专家撰写结论的大规模实时基准测试，旨在评估开放域科学结论的综合能力。该基准测试采用了一套经专家验证的自动化评估流程，将结论分解为原子事实，并通过事实精确率（Precision）和召回率（Recall）来衡量其准确性和全面性。

To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. 为了减轻数据泄露的影响，我们进一步引入了 SciConHarness，这是一个“净室”（clean-room）评估工具，通过为智能体提供受控的网络交互环境，确保测量结果的有效性。在对 8 个前沿模型和深度研究智能体进行评估后，我们发现其事实质量仍然较低：在净室环境下，表现最好的智能体事实 F1 分数仅为 0.337。

Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models’ true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. 与无约束评估相比，我们的净室环境始终会导致性能下降，这表明数据泄露夸大了模型真实综合能力的评估结果。最后，我们对面向消费者的智能体（如 Google AI Overview、OpenEvidence）进行了审计，发现它们经常生成不完整甚至自相矛盾的结论，即便在已有标准答案的情况下也是如此。

Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents. 总的来说，我们的研究结果表明，可靠地综合科学结论仍然是一个待解决的挑战，且净室评估对于评估开放域 AI 智能体至关重要。