Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

基于大型语言模型集成识别 PubMed 中 EQ-5D 研究的摘要

Abstract: The rapid increase in scientific publications leads to the fact that manual study screening in systematic literature reviews (SLRs) is increasingly resource consuming, inefficient, and inconsistent. Classifying studies that clearly report health-related quality-of-life results, such as EQ-5D data, requires a high level of clinical interpretation and poses challenges for human reviewers.

摘要: 科学出版物的快速增长导致系统文献综述(SLR)中的人工研究筛选变得越来越耗费资源、效率低下且不一致。对明确报告健康相关生活质量结果(如 EQ-5D 数据)的研究进行分类,需要高水平的临床解读能力,这对人工审稿人构成了挑战。

This study investigates the use of Google’s Gemini and Gemma large language models (LLMs) in automating EQ-5D detection in the PubMed biomedical database based only on published abstracts. A multi-phase framework is proposed that integrates few-shot prompting, weight ensembling aggregation, and a soft stacking meta-classifier.

本研究探讨了使用谷歌的 Gemini 和 Gemma 大型语言模型(LLM),仅根据已发表的摘要自动检测 PubMed 生物医学数据库中的 EQ-5D 研究。研究提出了一个多阶段框架,集成了少样本提示(few-shot prompting)、加权集成聚合(weight ensembling aggregation)和软堆叠元分类器(soft stacking meta-classifier)。

Nine LLMs are evaluated on a dataset of PubMed studies manually labeled by two experts regarding EQ-5D reporting. The weighted ensemble of gemini-2.5-pro, gemma-3-12b, and gemma-3-27b obtained a 0.74 weighted F1-score and 0.74 accuracy, exceeding individually attained results.

研究在由两名专家手动标注 EQ-5D 报告情况的 PubMed 研究数据集上评估了九个大语言模型。gemini-2.5-pro、gemma-3-12b 和 gemma-3-27b 的加权集成模型获得了 0.74 的加权 F1 分数和 0.74 的准确率,超过了单个模型的结果。

The ensembling of top-performing models improved the balance between precision and recall compared to individual models, while the soft stacking approach provided greater reliability and interpretability. Feature analysis shows that the probability results from the models are important in guiding the final predictions.

与单个模型相比,表现最佳模型的集成改善了精确率和召回率之间的平衡,而软堆叠方法则提供了更高的可靠性和可解释性。特征分析表明,模型得出的概率结果对于指导最终预测至关重要。

The findings suggest that an ensemble-based LLM setup is a reliable and scalable approach for automating screening in biomedical research.

研究结果表明,基于集成的 LLM 设置是一种用于生物医学研究自动化筛选的可靠且可扩展的方法。