Can AI Agents Synthesize Scientific Conclusions?
Can AI Agents Synthesize Scientific Conclusions?
AI 智能体能否综合得出科学结论?
Abstract: Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. 摘要: 科学类 AI 智能体正越来越多地被用于检索证据、跨来源推理并综合得出用于重大决策的结论。然而,它们在医疗等高风险领域执行此类任务的能力尚不明确。
We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. 我们推出了 SciConBench,这是一个包含 9,110 个问题及系统综述中专家撰写结论的大规模实时基准测试,旨在评估开放域科学结论的综合能力。该基准测试采用了一套经专家验证的自动化评估流程,将结论分解为原子事实,并通过事实精确率(Precision)和召回率(Recall)来衡量其准确性和全面性。
To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. 为了减轻数据泄露的影响,我们进一步引入了 SciConHarness,这是一个“净室”(clean-room)评估工具,通过为智能体提供受控的网络交互环境,确保测量结果的有效性。在对 8 个前沿模型和深度研究智能体进行评估后,我们发现其事实质量仍然较低:在净室环境下,表现最好的智能体事实 F1 分数仅为 0.337。
Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models’ true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. 与无约束评估相比,我们的净室环境始终会导致性能下降,这表明数据泄露夸大了模型真实综合能力的评估结果。最后,我们对面向消费者的智能体(如 Google AI Overview、OpenEvidence)进行了审计,发现它们经常生成不完整甚至自相矛盾的结论,即便在已有标准答案的情况下也是如此。
Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents. 总的来说,我们的研究结果表明,可靠地综合科学结论仍然是一个待解决的挑战,且净室评估对于评估开放域 AI 智能体至关重要。