BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

Title: BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking 标题: BioELX:基于别名检索与大语言模型排序的跨语言生物医学实体链接

Abstract: Cross-lingual biomedical entity linking (BEL) maps mentions in any language to unique identifiers in a biomedical knowledge base (KB), supporting clinical and biomedical NLP applications. However, expert-annotated training data for BEL are costly, especially for low-resource languages. Moreover, many cross-lingual BEL systems rely on SapBERT-based retrievers trained on predominantly English aliases in the KB, leading to poor generalization to unseen non-English mentions and limited context-aware disambiguation. 摘要: 跨语言生物医学实体链接(BEL)旨在将任何语言中的提及(mentions)映射到生物医学知识库(KB)中的唯一标识符,从而支持临床和生物医学自然语言处理(NLP)应用。然而,专家标注的 BEL 训练数据成本高昂,对于低资源语言尤其如此。此外,许多跨语言 BEL 系统依赖于基于 SapBERT 的检索器,这些检索器主要在知识库的英文别名上进行训练,导致其对未见过的非英语提及泛化能力较差,且缺乏上下文感知的消歧能力。

We propose BioELX, a two-stage cross-lingual BEL framework that requires no task-specific annotated training corpora. In Stage 1, we enrich SapBERT training with Wikidata-derived multilingual aliases and use the resulting retriever to improve cross-lingual candidate retrieval. In Stage 2, we perform context-aware disambiguation with a pre-trained LLM ranker that jointly considers the mention context and candidate, eliminating the need for supervised training. 我们提出了 BioELX,这是一个无需特定任务标注训练语料库的两阶段跨语言 BEL 框架。在第一阶段,我们利用从 Wikidata 获取的多语言别名来增强 SapBERT 的训练,并使用由此产生的检索器来改进跨语言候选实体的检索。在第二阶段,我们利用预训练的大语言模型(LLM)排序器进行上下文感知消歧,该排序器综合考虑了提及的上下文和候选实体,从而消除了对监督训练的需求。

Experiments on five benchmarks (XL-BEL, EMEA, Patent, WikiMed-DE, and MedMentions) show that BioELX achieves new state-of-the-art performance. It improves average Recall@1 on XL-BEL by +19.2, with especially large gains for low-resource languages, e.g., +21.6 on Turkish, +22.1 on Korean, +30.8 on Thai, and delivers consistent improvements on EMEA (+6.2), Patent (+5.4), and WikiMed-DE (+12.8). Code and resources will be released upon publication. 在五个基准测试(XL-BEL、EMEA、Patent、WikiMed-DE 和 MedMentions)上的实验表明,BioELX 达到了新的行业领先水平(SOTA)。它将 XL-BEL 上的平均 Recall@1 提升了 19.2 个百分点,在低资源语言上表现尤为突出,例如土耳其语提升了 21.6,韩语提升了 22.1,泰语提升了 30.8;同时在 EMEA (+6.2)、Patent (+5.4) 和 WikiMed-DE (+12.8) 上也实现了持续的性能提升。相关代码和资源将在论文发表后公开。