Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes
Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes
基于前沿大语言模型的智能体能够克服自然表型本体标注的瓶颈
Abstract: Linking free-text phenotype descriptions to ontology terms, typically referred to as phenotype annotation, is essential for the cross-study integration of comparative morphological data. This labor intensive process has heavily relied on highly trained human experts, which makes it challenging to scale and thus a key bottleneck.
摘要: 将自由文本的表型描述与本体术语相关联(通常称为表型标注)对于比较形态学数据的跨研究整合至关重要。这一劳动密集型过程长期以来高度依赖训练有素的人类专家,这使得其难以扩展,从而成为一个关键瓶颈。
Dahdul et al. (2018) established a Gold Standard (GS) of Entity-Quality (EQ) annotations across seven phylogenetic studies and used it to evaluate three human curators and the Semantic CharaParser NLP tool with ontology-based semantic similarity metrics; they reported that machine-human consistency was significantly lower than inter-curator (human-human) consistency.
Dahdul 等人(2018)针对七项系统发育研究建立了一套实体-质量(EQ)标注的黄金标准(GS),并利用基于本体的语义相似度指标,评估了三位人类策展人和 Semantic CharaParser 自然语言处理工具的表现;研究报告指出,机器与人类之间的一致性显著低于策展人之间(人与人)的一致性。
Here we revisit that benchmark with five frontier hosted LLMs from Anthropic and OpenAI, each operating as an “agentic curator” within a self-contained workspace that supplies the source publication PDF, the same annotation guide used by the original human curators, the four project ontologies (UBERON, PATO, BSPO, GO), and a validation script.
在此,我们利用来自 Anthropic 和 OpenAI 的五款前沿托管大语言模型(LLM)重新审视了这一基准测试。每个模型都作为一个“智能策展人”在独立的作业空间内运行,该空间提供了原始出版物的 PDF、原始人类策展人所使用的相同标注指南、四个项目本体(UBERON、PATO、BSPO、GO)以及一个验证脚本。
Evaluated against the same Gold Standard, every agent fell within the range of inter-curator variability of the three trained human biocurators of the original study; the best performing agents approached but did not reach the best performing human curator. Agents substantially outperformed Semantic CharaParser on all four metrics.
在与上述黄金标准的对比评估中,每个智能体的表现均处于原始研究中三位受过培训的人类生物策展人的变异范围之内;表现最好的智能体接近但尚未达到表现最好的人类策展人的水平。在所有四项指标上,智能体的表现均大幅优于 Semantic CharaParser。