Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model
Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model
使用开源权重大型语言模型从脑部 MRI 报告中自动提取结构化信息
Abstract: 摘要:
Objectives: Automatic data extraction from free-text radiology reports enables large-scale research, but few studies assessed the performance of large language models (LLMs) on Dutch neuroradiology reports. 目标:从自由文本放射学报告中自动提取数据有助于进行大规模研究,但目前很少有研究评估大型语言模型(LLM)在荷兰语神经放射学报告上的表现。
Methods: We analyzed 947 brain MRI reports from a tertiary memory clinic (2016-2021), authored by consultant neuroradiologists. Trained medical students annotated thirty variables; 100 reports were double-annotated to assess inter-rater reliability. We evaluated the performance of the open-weight LLM LLaMA 3.1 using different languages (Dutch vs. English translation) and few-shot prompting with different example selection strategies. Performance was evaluated using balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free-text. Metrics were computed across 10 random splits of the 947 reports. 方法:我们分析了来自一家三级记忆诊所(2016-2021年)的 947 份由神经放射科顾问医生撰写的脑部 MRI 报告。受过培训的医学生对 30 个变量进行了标注;其中 100 份报告进行了双重标注以评估评分者间信度。我们评估了开源权重 LLM LLaMA 3.1 在不同语言(荷兰语与英语翻译版)以及采用不同示例选择策略的少样本提示(few-shot prompting)下的表现。性能评估指标包括:分类变量的平衡准确率、计数变量的准确率和平均绝对误差,以及自由文本的文本相似度。各项指标均在 947 份报告的 10 次随机拆分中计算得出。
Results: LLaMA 3.1 demonstrated high zero-shot performance for visual rating scores (mean [95%-CI]): Medial Temporal Atrophy: 90% [77-100%] on the left and 96% [94-99%] on the right, Global Cortical Atrophy: 87% [83-91%], and Fazekas: 94% [93-96%]. Microbleed mentions were detected with 93% accuracy [92-95%] and infarct mentions with 82% [80-84%]. Text similarity for lesion location reached 0.95 [0.95-0.96]. Performance was lower for numerical variables: 80% [78-82%] for the number of microbleeds and 66% [63-68%] for infarcts. English translation yielded comparable results. Few-shot prompting improved performance for numerical variables, achieving 92% [90-93%] for microbleeds and 81% [77-85%] for infarcts using structural similarity-based selection. 结果:LLaMA 3.1 在视觉评分指标上表现出极高的零样本(zero-shot)性能(平均值 [95% 置信区间]):内侧颞叶萎缩:左侧 90% [77-100%],右侧 96% [94-99%];整体皮质萎缩:87% [83-91%];Fazekas 评分:94% [93-96%]。微出血提及的检测准确率为 93% [92-95%],梗死提及的检测准确率为 82% [80-84%]。病灶位置的文本相似度达到 0.95 [0.95-0.96]。数值型变量的表现相对较低:微出血数量为 80% [78-82%],梗死数量为 66% [63-68%]。英语翻译版本得出的结果与荷兰语相当。少样本提示提升了数值型变量的表现,在使用基于结构相似性的选择策略时,微出血的准确率达到 92% [90-93%],梗死的准确率达到 81% [77-85%]。
Conclusion: LLaMA 3.1 shows strong potential for extracting data from Dutch neuroradiology reports. Few-shot prompting enhances performance for numerical variables, whereas challenges remain for location-specific variables. 结论:LLaMA 3.1 在从荷兰语神经放射学报告中提取数据方面展现出巨大潜力。少样本提示增强了数值型变量的提取性能,但在特定位置变量的提取上仍存在挑战。