EDEN: A Large-Scale Corpus of Clinical Notes for Italian

EDEN：意大利语临床笔记大规模语料库

We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department.

我们介绍了 EDEN（急诊科电子笔记），这是一个由意大利医院急诊科产生的全新且独特的大规模临床笔记语料库。该语料库的当前版本包含约 400 万份已完全脱敏的临床笔记，涵盖了患者在急诊科就诊期间的各个护理阶段。

In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 items relevant for two patient situations in emergency departments, dyspnea and loss of consciousness. Items may assume numerical values (e.g., for blood saturation), categorical (e.g., for level of consciousness), binary (e.g., for presence of traumas), and mixed value types.

此外，临床专家通过结构化的病例报告表（CRF）对其中约六千份笔记的子集进行了人工标注。该表格包含 132 个与急诊科两种患者情况（呼吸困难和意识丧失）相关的项目。这些项目可以是数值型（如血氧饱和度）、分类变量（如意识水平）、二元变量（如是否存在外伤）或混合类型。

The annotation process involved multiple clinicians and underwent iterative revision to resolve ambiguities in item formulation, resulting in a richly structured (although high imbalanced) resource. The dataset aims to fill a relevant gap of data able to support both the development and the use of Large Language Models in concrete medical applications.

标注过程涉及多位临床医生，并经过了反复修订以解决项目表述中的歧义，最终形成了一个结构丰富（尽管高度不平衡）的资源。该数据集旨在填补数据空白，以支持大型语言模型在具体医疗应用中的开发与使用。

We describe the data collection protocol, the on-site anonymisation pipeline, corpus statistics, and the annotation scheme. Finally, we propose CRF-filling as a novel structured information extraction benchmark, and provide zero-shot baseline resulting from Gemma-27B and MedGemma-27B. To the best of our knowledge, the EDEN dataset is the largest freely available corpus of clinical notes existing for the Italian language.

我们描述了数据收集协议、现场脱敏流程、语料库统计数据以及标注方案。最后，我们将 CRF 填充（CRF-filling）作为一种新型结构化信息提取基准提出，并提供了基于 Gemma-27B 和 MedGemma-27B 的零样本（zero-shot）基准测试结果。据我们所知，EDEN 数据集是目前意大利语中最大的免费临床笔记语料库。