In Harvard study, AI offered more accurate emergency room diagnoses than two human doctors

哈佛大学研究显示：AI 在急诊室诊断的准确率超过两名人类医生

A new study examines how large language models perform in a variety of medical contexts, including real emergency room cases — where at least one model seemed to be more accurate than human doctors. The study was published this week in Science and comes from a research team led by physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center. 一项新的研究考察了大语言模型在各种医疗场景中的表现，包括真实的急诊室病例——在这些病例中，至少有一个模型表现得比人类医生更准确。该研究于本周发表在《科学》（Science）杂志上，由哈佛医学院和贝斯以色列女执事医疗中心（Beth Israel Deaconess Medical Center）的医生和计算机科学家组成的团队共同完成。

The researchers said they conducted a variety of experiments to measure how OpenAI’s models compared to human physicians. In one experiment, researchers focused on 76 patients who came into the Beth Israel emergency room, comparing the diagnoses offered by two internal medicine attending physicians to those generated by OpenAI’s o1 and 4o models. These diagnoses were assessed by two other attending physicians, who did not know which ones came from humans and which came from AI. 研究人员表示，他们进行了多项实验，以衡量 OpenAI 的模型与人类医生的表现对比。在其中一项实验中，研究人员选取了 76 名进入贝斯以色列急诊室的患者，将两名内科主治医生的诊断结果与 OpenAI 的 o1 和 4o 模型生成的诊断结果进行了比较。这些诊断结果由另外两名主治医生进行评估，他们并不知道哪些诊断来自人类，哪些来自 AI。

“At each diagnostic touchpoint, o1 either performed nominally better than or on par with the two attending physicians and 4o,” the study said, adding that the differences “were especially pronounced at the first diagnostic touchpoint (initial ER triage), where there is the least information available about the patient and the most urgency to make the correct decision.” 研究指出：“在每一个诊断节点上，o1 的表现要么略好于两名主治医生和 4o 模型，要么与之持平。”研究还补充说，这种差异“在第一个诊断节点（急诊初步分诊）尤为明显，因为此时关于患者的信息最少，且做出正确决策的紧迫性最高。”

In Harvard Medical School’s press release about the study, the researchers emphasized that they did not “pre-process the data at all” — the AI models were presented with the same information that was available in the electronic medical records at the time of each diagnosis. With that information, the o1 model managed to offer “the exact or very close diagnosis” in 67% of triage cases, compared to one physician who had the exact or close diagnosis 55% of the time, and to the other who hit the mark 50% of the time. 在哈佛医学院关于该研究的新闻稿中，研究人员强调他们“完全没有对数据进行预处理”——AI 模型所获取的信息与当时电子病历中可用的信息完全一致。基于这些信息，o1 模型在 67% 的分诊病例中给出了“准确或非常接近的诊断”，而其中一名医生在 55% 的病例中给出了准确或接近的诊断，另一名医生则为 50%。

“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” said Arjun Manrai, who heads an AI lab at Harvard Medical School and is one of the study’s lead authors, in the press release. 哈佛医学院 AI 实验室负责人、该研究的主要作者之一 Arjun Manrai 在新闻稿中表示：“我们针对几乎所有基准测试了该 AI 模型，它的表现超越了之前的模型以及我们设定的医生基准。”

To be clear, the study didn’t claim that AI is ready to make real life-or-death decisions in the emergency room. Instead, it said the findings show an “urgent need for prospective trials to evaluate these technologies in real-world patient care settings.” The researchers also noted that they only studied how models performed when provided with text-based information, and that “existing studies suggest that current foundation models are more limited in reasoning over nontext inputs.” 需要明确的是，该研究并未声称 AI 已经准备好在急诊室做出真正的生死决策。相反，研究指出，这些发现表明“迫切需要进行前瞻性试验，以评估这些技术在现实患者护理环境中的表现”。研究人员还指出，他们仅研究了模型在提供基于文本的信息时的表现，并提到“现有研究表明，当前的基础模型在处理非文本输入的推理方面能力更为有限”。

Adam Rodman, a Beth Israel doctor who’s also one of the study’s lead authors, warned the Guardian that there’s “no formal framework right now for accountability” around AI diagnoses, and that patients still “want humans to guide them through life or death decisions [and] to guide them through challenging treatment decisions.” 贝斯以色列医院的医生、该研究的主要作者之一 Adam Rodman 向《卫报》警告称，目前针对 AI 诊断的“问责制尚无正式框架”，且患者仍然“希望由人类来引导他们做出涉及生死的决策，以及引导他们做出具有挑战性的治疗决定”。

In a post about the study, Kristen Panthagani, an emergency physician, said this is an “an interesting AI study that has led to some very overhyped headlines,” especially since it was comparing AI diagnoses to those from internal medicine physicians, not ER physicians. “If we’re going to compare AI tools to physicians’ clinical ability, we should start by comparing to physicians who actually practice that specialty,” Panthagani said. “I would not be surprised if a LLM could beat a dermatologist at an neurosurgery board exam, [but] that’s not a particularly helpful thing to know.” 急诊科医生 Kristen Panthagani 在一篇关于该研究的文章中表示，这是一项“有趣的 AI 研究，但导致了一些过度炒作的标题”，尤其是因为它将 AI 的诊断与内科医生而非急诊科医生的诊断进行了比较。Panthagani 说：“如果我们打算将 AI 工具与医生的临床能力进行比较，我们应该首先将其与实际从事该专业的医生进行比较。如果一个大语言模型能在神经外科委员会考试中击败皮肤科医生，我一点也不会感到惊讶，但这并没有什么实际参考价值。”

She also argued, “As an ER doctor seeing a patient for a first time, my primary goal is not to guess your ultimate diagnosis. My primary goal is to determine if you have a condition that could kill you.” 她还指出：“作为一名首次接诊患者的急诊科医生，我的首要目标不是猜测你的最终诊断，而是确定你是否患有可能会危及生命的疾病。”