OpenAI's o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors

OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors

OpenAI 的 o1 模型在急诊患者诊断中准确率达 67%，优于分诊医生的 50-55%

The study found the AI advantage was particularly pronounced in triage situations requiring fast decisions with minimal information. 研究发现，在需要根据极少信息做出快速决策的分诊情况下，人工智能的优势尤为显著。

AI outperforms doctors in Harvard trial of emergency triage diagnoses. Researchers say results mark a ‘profound change in technology that will reshape medicine’. 在哈佛大学的一项急诊分诊诊断试验中，人工智能的表现优于医生。研究人员表示，这一结果标志着“技术领域的一场深刻变革，将重塑医学”。

From George Clooney in ER to Noah Wyle in The Pitt, emergency department doctors have long been popular heroes. But will it soon be time to hang up the scrubs? 从《急诊室的故事》(ER) 中的乔治·克鲁尼到《急诊室的故事》(The Pitt) 中的诺亚·怀尔，急诊科医生长期以来一直是大众心目中的英雄。但现在是否到了他们脱下手术服的时候了？

A groundbreaking Harvard study has found that AI systems outperformed human doctors in high-pressure emergency medicine triage, diagnosing more accurately in the potentially life and death moments when people are first rushed to hospital. 哈佛大学一项开创性的研究发现，在压力巨大的急诊分诊中，人工智能系统的表现优于人类医生。在患者被紧急送往医院、处于生死攸关的最初时刻，人工智能的诊断更为准确。

The results were described by independent experts as showing “a genuine step forward” in the clinical reasoning of AIs and came as part of trials that tested the responses of hundreds of doctors against an AI. 独立专家将这些结果描述为人工智能临床推理方面“真正的进步”。这些结果来自于一项将数百名医生的反应与人工智能进行对比的试验。

The authors said the results, published in the journal Science, showed large language models (LLMs) “have eclipsed most benchmarks of clinical reasoning”. 研究作者在《科学》杂志上发表文章称，结果显示大型语言模型 (LLMs) “已经超越了大多数临床推理的基准”。

One experiment focused on 76 patients who arrived at the emergency room of a Boston hospital. An AI and a pair of human doctors were each given the same standard electronic health record to read – typically including vital sign data, demographic information and a few sentences from a nurse about why the patient was there. The AI identified the exact or very close diagnosis in 67% of cases, beating the human doctors, who were right only 50%-55% of the time. 一项实验针对 76 名抵达波士顿某医院急诊室的患者进行。人工智能和两名人类医生分别阅读相同的标准电子健康记录——通常包括生命体征数据、人口统计信息以及护士关于患者就诊原因的简短描述。人工智能在 67% 的病例中给出了准确或非常接近的诊断，击败了准确率仅为 50%-55% 的人类医生。

It showed the AIs’ advantage was particularly pronounced in triage circumstances requiring rapid decisions with minimal information. The diagnosis accuracy of the AI – OpenAI’s o1 reasoning model – rose to 82% when more detail was available, compared with the 70-79% accuracy achieved by the expert humans, though this difference was not statistically significant. 研究表明，在需要根据极少信息做出快速决策的分诊情况下，人工智能的优势尤为显著。当提供更多细节时，OpenAI 的 o1 推理模型的诊断准确率上升至 82%，而人类专家的准确率为 70-79%，尽管这一差异在统计学上并不显著。

It also outperformed a larger cohort of human doctors when asked to provide longer term treatment plans, such as providing antibiotics regimes or planning end-of-life processes. The AI and 46 doctors were asked to examine five clinical case studies and the computer made significantly better plans, scoring 89% compared with 34% for humans using conventional resources, such as search engines. 在提供长期治疗方案（如抗生素疗程或临终关怀规划）时，它也优于更大规模的人类医生群体。研究要求人工智能和 46 名医生检查五个临床案例，结果计算机制定了明显更好的方案，得分为 89%，而使用搜索引擎等传统资源的人类医生得分为 34%。

But it is not curtains for emergency doctors yet, the researchers said. The study only tested humans against AIs looking at patient data that can be communicated via text. The AI’s reading of signals, such as the patient’s level of distress and their visual appearance, were not tested. That means the AI was performing more like a clinician producing a second opinion based on paperwork. 但研究人员表示，现在还不到急诊医生“谢幕”的时候。该研究仅测试了人类与人工智能在查看可通过文本传达的患者数据时的表现。人工智能对患者痛苦程度和外貌等信号的解读并未经过测试。这意味着人工智能的表现更像是一名基于文书工作提供第二诊疗意见的临床医生。

“I don’t think our findings mean that AI replaces doctors,” said Arjun Manrai, one of the lead authors of the study who heads an AI lab at Harvard Medical School. “I think it does mean that we’re witnessing a really profound change in technology that will reshape medicine.” “我不认为我们的研究结果意味着人工智能会取代医生，”哈佛医学院人工智能实验室负责人、该研究的主要作者之一 Arjun Manrai 表示，“我认为这确实意味着我们正在见证一场将重塑医学的深刻技术变革。”

Dr Adam Rodman, another lead author and a doctor at Boston’s Beth Israel Deaconess medical centre where the study took place, said AI LLMs were among “the most impactful technologies in decades”. Over the next decade, he said, AI would not replace physicians but join them in a new “triadic care model … the doctor, the patient, and an artificial intelligence system”. 另一位主要作者、该研究所在地波士顿贝斯以色列女执事医疗中心 (Beth Israel Deaconess Medical Center) 的医生 Adam Rodman 博士表示，人工智能大语言模型是“几十年来最具影响力的技术之一”。他说，在未来十年里，人工智能不会取代医生，而是会加入他们，形成一种新的“三元护理模式……医生、患者和人工智能系统”。

In one case in the Harvard study, a patient presented with a blood clot to the lungs and worsening symptoms. Human doctors thought the anti-coagulants were failing, but the AI noticed something the humans did not: the patient’s history of lupus meant this might be causing the inflammation of the lungs. The AI was proved correct. 在哈佛研究的一个案例中，一名患者出现肺部血栓且症状恶化。人类医生认为抗凝药物失效了，但人工智能注意到了人类未发现的情况：患者的狼疮病史可能导致了肺部炎症。事实证明人工智能是正确的。

Nearly one in five US physicians are already using AI to assist diagnosis, according to research published last month. In the UK, 16% of doctors are using the tech daily and a further 15% weekly, with “clinical decision-making” being one of the most common uses, according to a recent Royal College of Physicians survey. 根据上个月发表的研究，近五分之一的美国医生已经在使用人工智能辅助诊断。英国皇家内科医师学会最近的一项调查显示，在英国，16% 的医生每天都在使用该技术，另有 15% 的医生每周使用，“临床决策”是最常见的用途之一。

The UK doctors’ biggest concerns were AI error and liability risks. Billions are being invested in AI healthcare companies, but questions remain about the consequences of AI error. 英国医生最担心的是人工智能错误和责任风险。数十亿美元正投入到人工智能医疗公司，但关于人工智能错误后果的问题依然存在。

“There is not a formal framework right now for accountability,” said Rodman, who also stressed patients ultimately “want humans to guide them through life or death decisions [and] to guide them through challenging treatment decisions”. “目前还没有正式的问责框架，”Rodman 说。他还强调，患者最终“希望由人类来引导他们做出有关生死的决定，并引导他们做出具有挑战性的治疗决策”。

Prof Ewen Harrison, co-director of the University of Edinburgh’s centre for medical informatics, said the study was important and showed that “these systems are no longer just passing medical exams or solving artificial test cases. They are starting to look like useful second-opinion tools for clinicians, particularly when it is important to consider a wider range of possible diagnoses and avoid missing something important.” 爱丁堡大学医学信息学中心联席主任 Ewen Harrison 教授表示，这项研究很重要，它表明“这些系统不再仅仅是通过医学考试或解决人工测试案例。它们开始看起来像是临床医生有用的第二诊疗意见工具，特别是在需要考虑更广泛的可能诊断并避免遗漏重要信息时。”

Dr Wei Xing, an assistant professor at the University of Sheffield’s school of mathematical and physical sciences, said some of the other findings suggested doctors may unconsciously defer to the AI’s answer rather than thinking independently. 谢菲尔德大学数学与物理科学学院助理教授 Wei Xing 博士表示，其他一些研究结果表明，医生可能会在无意识中顺从人工智能的答案，而不是进行独立思考。

“This tendency could grow more significant as AI becomes more routinely used in clinical settings,” he said. He also highlighted the lack of information about which patients the AI was worse at diagnosing and whether it struggled more with elderly patients or non-English speakers. “随着人工智能在临床环境中的常规使用，这种倾向可能会变得更加显著，”他说。他还强调，目前缺乏关于人工智能在哪些患者身上诊断效果较差的信息，以及它在面对老年患者或非英语母语患者时是否更吃力。

He said: “It does not demonstrate that AI is safe for routine clinical use, nor that the public should turn to freely available AI tools as a substitute for medical advice.” 他说：“这并不能证明人工智能可以安全地用于常规临床，也不能证明公众应该将免费的人工智能工具作为医疗建议的替代品。”