BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

Title: BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking 标题： BioELX：基于别名检索与大语言模型排序的跨语言生物医学实体链接

Abstract: Cross-lingual biomedical entity linking (BEL) maps mentions in any language to unique identifiers in a biomedical knowledge base (KB), supporting clinical and biomedical NLP applications. However, expert-annotated training data for BEL are costly, especially for low-resource languages. Moreover, many cross-lingual BEL systems rely on SapBERT-based retrievers trained on predominantly English aliases in the KB, leading to poor generalization to unseen non-English mentions and limited context-aware disambiguation. 摘要： 跨语言生物医学实体链接（BEL）旨在将任何语言中的提及（mentions）映射到生物医学知识库（KB）中的唯一标识符，从而支持临床和生物医学自然语言处理（NLP）应用。然而，专家标注的 BEL 训练数据成本高昂，对于低资源语言尤其如此。此外，许多跨语言 BEL 系统依赖于基于 SapBERT 的检索器，这些检索器主要在知识库的英文别名上进行训练，导致其对未见过的非英语提及泛化能力较差，且缺乏上下文感知的消歧能力。

We propose BioELX, a two-stage cross-lingual BEL framework that requires no task-specific annotated training corpora. In Stage 1, we enrich SapBERT training with Wikidata-derived multilingual aliases and use the resulting retriever to improve cross-lingual candidate retrieval. In Stage 2, we perform context-aware disambiguation with a pre-trained LLM ranker that jointly considers the mention context and candidate, eliminating the need for supervised training. 我们提出了 BioELX，这是一个无需特定任务标注训练语料库的两阶段跨语言 BEL 框架。在第一阶段，我们利用从 Wikidata 获取的多语言别名来增强 SapBERT 的训练，并使用由此产生的检索器来改进跨语言候选实体的检索。在第二阶段，我们利用预训练的大语言模型（LLM）排序器进行上下文感知消歧，该排序器综合考虑了提及的上下文和候选实体，从而消除了对监督训练的需求。

Experiments on five benchmarks (XL-BEL, EMEA, Patent, WikiMed-DE, and MedMentions) show that BioELX achieves new state-of-the-art performance. It improves average Recall@1 on XL-BEL by +19.2, with especially large gains for low-resource languages, e.g., +21.6 on Turkish, +22.1 on Korean, +30.8 on Thai, and delivers consistent improvements on EMEA (+6.2), Patent (+5.4), and WikiMed-DE (+12.8). Code and resources will be released upon publication. 在五个基准测试（XL-BEL、EMEA、Patent、WikiMed-DE 和 MedMentions）上的实验表明，BioELX 达到了新的行业领先水平（SOTA）。它将 XL-BEL 上的平均 Recall@1 提升了 19.2 个百分点，在低资源语言上表现尤为突出，例如土耳其语提升了 21.6，韩语提升了 22.1，泰语提升了 30.8；同时在 EMEA (+6.2)、Patent (+5.4) 和 WikiMed-DE (+12.8) 上也实现了持续的性能提升。相关代码和资源将在论文发表后公开。