An AI agent for treatment reasoning over a biomedical tool universe

一种用于生物医学工具库治疗推理的 AI 智能体

Treatment reasoning underpins every therapeutic decision, integrating disease context, comorbidities, medications, contraindications, and evolving biomedical knowledge to select an appropriate therapy. It is inherently iterative: candidates are weighed against many constraints, revised as evidence emerges, and grounded in verifiable sources.

治疗推理是每一项医疗决策的基础，它整合了疾病背景、合并症、药物、禁忌症以及不断发展的生物医学知识，以选择合适的疗法。这一过程本质上是迭代的：候选方案需在多种约束条件下进行权衡，随着证据的出现不断修正，并建立在可验证的来源之上。

Here we introduce ATHENA-R1, an AI agent for treatment reasoning across all FDA approved drugs since 1939, trained by reinforcement learning over a universe of 212 biomedical tools. At each step it identifies missing information, selects and runs relevant tools, and incorporates the evidence.

在此，我们介绍了 ATHENA-R1，这是一个用于对 1939 年以来所有 FDA 批准药物进行治疗推理的 AI 智能体，它通过强化学习在包含 212 种生物医学工具的库中进行训练。在每一步中，它都能识别缺失的信息，选择并运行相关工具，并将证据整合起来。

To train it without human-annotated traces, we build a two-level self-learning framework: multi-agent systems construct the tools, tasks, and reasoning trajectories for supervised fine-tuning, then reinforcement learning with scientific feedback rewards reasoning quality (evidence gathering, grounded tool use, logical non-redundancy).

为了在没有人工标注轨迹的情况下对其进行训练，我们构建了一个两级自学习框架：多智能体系统构建工具、任务和推理轨迹以进行监督微调，随后通过带有科学反馈的强化学习来奖励推理质量（包括证据收集、基于事实的工具使用以及逻辑上的非冗余性）。

Across five benchmarks of 3,168 drug reasoning tasks and 456 patient treatment cases, ATHENA-R1 outperforms language models and tool-use systems, reaching 94.7% accuracy on open-ended drug reasoning and 82.9% on treatment reasoning, 17.8 and 10.7 points above GPT-5.

在涵盖 3,168 项药物推理任务和 456 个患者治疗案例的五个基准测试中，ATHENA-R1 的表现优于现有的语言模型和工具使用系统，在开放式药物推理任务中达到了 94.7% 的准确率，在治疗推理任务中达到了 82.9% 的准确率，分别比 GPT-5 高出 17.8 和 10.7 个百分点。

In blinded evaluations by experts from 28 rare disease organizations, it is preferred over reference models on all criteria, and physicians rated it favorably on complex hospitalized cardiovascular and infectious-disease cases. Adverse-event hypotheses it generated, tested in electronic health records from 5.4 million patients, reached adjusted odds ratios of 1.48-1.84, with no elevation among negative controls.

在由来自 28 个罕见病组织的专家进行的盲测评估中，它在所有标准上均优于参考模型，并且医生在处理复杂的住院心血管和传染病病例时对其评价颇高。它生成的药物不良反应假设在 540 万名患者的电子健康记录中进行了验证，调整后的优势比（OR）达到 1.48-1.84，且在阴性对照组中未见升高。

Because it requires knowing what evidence to seek before concluding, treatment reasoning has long been hard for AI; we show it can be reframed as a learnable process of iterative evidence gathering that reinforcement learning can train AI to perform.

由于治疗推理要求在得出结论前明确需要寻找哪些证据，这长期以来一直是 AI 的难点；我们证明了它可以被重构为一个可学习的迭代证据收集过程，而强化学习能够训练 AI 完成这一过程。