Your doctor’s AI notetaker may be making things up, Ontario audit finds

Your doctor’s AI notetaker may be making things up, Ontario audit finds

安大略省审计发现:医生使用的 AI 记录员可能会“无中生有”

In recent years, many overworked doctors have turned to so-called AI medical scribes to help automatically summarize patient conversations, diagnoses, and care decisions into structured notes for health record logging. 近年来,许多工作过度的医生开始求助于所谓的“AI 医疗记录员”,以帮助将患者对话、诊断结果和护理决策自动总结为结构化笔记,并录入健康档案。

But a recent audit by the auditor general of Ontario found that AI scribes recommended by the provincial government regularly generated incorrect, incomplete and hallucinated information that could “potentially result in inadequate or harmful treatment plans that may potentially impact patient health outcomes.” 然而,安大略省审计长最近的一项审计发现,省政府推荐的 AI 记录员经常生成错误、不完整且虚构的信息,这“可能导致治疗方案不当或有害,进而影响患者的健康结果”。

In a recent report on Use of Artificial Intelligence in the Ontario Government, the auditor general reviewed transcription tests of two simulated patient-doctor conversations performed across 20 AI scribe vendors that were approved and pre-qualified by the provincial government for purchase by healthcare providers. 在最近一份关于《安大略省政府人工智能使用情况》的报告中,审计长审查了 20 家 AI 记录员供应商的转录测试。这些供应商均已获得省政府批准并具备预审资格,可供医疗服务提供者采购。

All 20 of those vendors showed some issue with accuracy or completeness in at least one of these simple tests, including nine that hallucinated patient information, 12 that recorded information incorrectly, and 17 that missed key details about discussed mental health issues. 在这些简单的测试中,所有 20 家供应商在至少一项测试中都表现出了准确性或完整性问题,其中 9 家虚构了患者信息,12 家记录信息有误,17 家遗漏了关于心理健康问题的关键细节。

In the report, the auditor general points out multiple concerning examples of mistakes in those summaries that could have a direct and negative impact on a patient’s subsequent care. That includes situations where an AI scribe hallucinated nonexistent referrals for blood tests or therapy, incorrectly transcribed the names of prescription medication, and/or missed “key details” of mental health issues discussed in the simulated conversations. 报告中,审计长指出了这些总结中多个令人担忧的错误案例,这些错误可能对患者后续的护理产生直接的负面影响。其中包括 AI 记录员虚构了不存在的血液检查或治疗转诊、错误转录处方药名称,以及遗漏模拟对话中讨论的心理健康“关键细节”等情况。

Across all approved vendors, the average tested AI scribe scored only a 12 out of 20 on the “accuracy of medical notes generated” section of Supply Ontario’s evaluation rubric. But that seemingly key “accuracy” metric was only responsible for about 4 percent of a vendor’s overall score, making it easy to meet the minimum threshold for approval even if an AI scribe scored a “zero” on the accuracy metric (a separate metric measuring “domestic presence in Ontario” was worth 30 percent of the overall scoring). 在所有获批供应商中,AI 记录员在安大略省供应局(Supply Ontario)评估标准的“医疗笔记生成准确性”部分,平均得分仅为 12 分(满分 20 分)。然而,这一看似关键的“准确性”指标仅占供应商总分的 4% 左右,这意味着即使 AI 记录员在准确性指标上得“零分”,也很容易达到获批的最低门槛(而衡量“在安大略省的本地业务存在”这一独立指标却占总分的 30%)。

All these factors contributed to the auditor general’s overall finding that these AI scribes “were not evaluated adequately.” In a display of restraint and understatement, the report notes that “it is important that AI scribe systems are tested to provide assurances as to the quality of their generated notes and to minimize inaccuracies.” It also recommends that IT departments using these scribes force doctors to “confirm their review of the notes produced” before committing them to patient logs. 所有这些因素导致审计长得出结论,认为这些 AI 记录员“未得到充分评估”。报告以克制且低调的口吻指出:“对 AI 记录系统进行测试,以确保其生成笔记的质量并最大限度地减少错误,这一点至关重要。”报告还建议,使用这些记录员的 IT 部门应强制要求医生在将笔记录入患者档案前,“确认已对所生成的笔记进行了审查”。

Public sector health services in Ontario are not required to use these AI scribe systems in their work and may purchase scribes from non-approved vendors if they wish. Still, the fact that the Ontario government recommended AI summary systems with such obvious and potentially patient-harming flaws should give pause to any doctors (or their patients) making use of them. 安大略省的公共医疗服务部门并不强制要求在工作中使用这些 AI 记录系统,如果愿意,他们也可以从非获批供应商处购买。尽管如此,安大略省政府推荐的 AI 总结系统存在如此明显且可能伤害患者的缺陷,这应当让任何使用它们的医生(或其患者)三思而后行。