A case study of evaluating AI agents on a neuroscience data-to-discovery pipeline

案例研究：评估神经科学“数据到发现”流水线中的 AI 智能体

Agentic AI tools offer a promising path to automating software development bottlenecks in scientific research pipelines, particularly for stages that take domain experts days to months to build, where scientists care about correctness and robustness, not implementation details.

智能体 AI 工具为自动化科学研究流水线中的软件开发瓶颈提供了一条有前景的路径，特别是在那些需要领域专家耗时数天至数月才能构建的阶段；在这些阶段中，科学家更关注结果的正确性和稳健性，而非具体的实现细节。

We present an empirical study of general-purpose coding agents on a fly optogenetics data-to-discovery pipeline. We assess agents on tasks substantially larger than existing benchmarks, datasets orders of magnitude bigger, and evaluation criteria grounded in domain expert standards.

我们针对果蝇光遗传学“数据到发现”流水线，对通用编程智能体进行了一项实证研究。我们评估的任务规模远超现有基准测试，数据集规模大出几个数量级，且评估标准基于领域专家的专业准则。

We show that agents can solve several individual pipeline stages, suggesting stage-level automation is tractable. By analyzing agents’ code iterations, we show that they struggle most when there is not a pre-defined criterion to iterate on, and they must instead use their scientific judgment to assess their current solution, a key open challenge.

研究表明，智能体能够解决流水线中的多个独立阶段，这说明阶段级的自动化是可行的。通过分析智能体的代码迭代过程，我们发现它们在缺乏预定义迭代标准时表现最为吃力，此时它们必须依靠科学判断来评估当前的解决方案，而这正是目前面临的一个关键挑战。

Mirroring scientific practice, they sometimes attempt visual inspection of intermediate outputs for self-evaluation, but largely fail to interpret what they see or act on it appropriately. Solving the end-to-end pipeline correctly requires stringing together successes across all pipeline stages, and this is beyond agents’ current abilities.

模仿科学实践，智能体有时会尝试通过目视检查中间输出来进行自我评估，但在解读所见内容或据此采取适当行动方面，它们大多以失败告终。要正确解决端到端的流水线问题，需要将所有阶段的成功串联起来，而这超出了智能体目前的能力范围。

We identify challenges largely absent from existing benchmarks, including computational resource management and generalization to large held-out data collections. Finally, we distill principles for constructing scientific tasks and rigorous evaluation criteria for open-ended problems.

我们识别出了一些现有基准测试中基本缺失的挑战，包括计算资源管理以及对大规模留出数据集的泛化能力。最后，我们总结了构建科学任务的原则，以及针对开放式问题的严谨评估标准。