Prompting language influences diagnostic reasoning and accuracy of large language models

Prompting language influences diagnostic reasoning and accuracy of large language models

提示词语言对大语言模型诊断推理能力及准确性的影响

Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. 大语言模型(LLMs)正越来越多地被探索用于临床决策支持,然而大多数评估都是以英语进行的,这使得它们在其他语言中的可靠性仍不确定。

Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). 本文通过比较五种大语言模型(o3、DeepSeek-R1、GPT-4-Turbo、Llama-3.1-405B-Instruct 和 BioMistral-7B)在英语和法语环境下的表现,评估了提示词语言对诊断推理和最终诊断准确性的影响。

A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. 两名医生使用 18 分制量表,对涵盖 16 个医学专业的总计 180 个临床病例进行了评估,该量表同时考量了诊断准确性和推理质量。

Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p < 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. 五种模型中有四种在英语环境下的表现更好(平均差异为 0.37-0.91,调整后 p < 0.05),且差距涵盖了推理的多个方面,包括鉴别诊断、逻辑结构和内部有效性。

o3 was the only model showing no overall language effect. o3 是唯一一个在整体上未表现出语言影响的模型。

These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide. 这些研究结果表明,提示词语言仍然是大语言模型临床表现的关键决定因素,这对全球范围内公平的语言文化部署具有重要意义。