MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports
MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports
MedStruct-S:针对 OCR 临床报告的关键发现、关键条件问答及半结构化提取的基准测试
Abstract: Semi-structured information extraction (IE) from OCR-derived clinical reports is crucial for efficiently reconstructing patients’ longitudinal medical histories. In practice, this scenario commonly involves three tasks: (i) field-header (key) discovery, (ii) key-conditioned question answering (QA), and (iii) end-to-end key-value pair extraction.
摘要: 从 OCR(光学字符识别)生成的临床报告中进行半结构化信息提取(IE),对于高效重建患者的纵向病史至关重要。在实际应用中,该场景通常涉及三项任务:(i) 字段标题(键)发现,(ii) 关键条件问答(QA),以及 (iii) 端到端键值对提取。
However, existing evaluations often under-model two factors: heterogeneous and incompletely known key representations, and OCR-induced noise. This makes it difficult to assess model robustness in real-world settings. We present MedStruct-S, a benchmark specifically designed to evaluate these tasks under unknown keys and OCR noise.
然而,现有的评估往往忽视了两个因素:异构且不完全已知的键表示,以及 OCR 引入的噪声。这使得在现实场景中评估模型的鲁棒性变得困难。我们提出了 MedStruct-S,这是一个专门设计用于在未知键和 OCR 噪声条件下评估这些任务的基准测试。
MedStruct-S contains 3,582 annotated real-world clinical report pages. Using MedStruct-S, we benchmark two representative paradigms: encoder-only sequence labeling with post-processing and decoder-only structured generation, covering four encoder-only and five decoder-only models spanning 0.11B to 103B parameters.
MedStruct-S 包含 3,582 页经过标注的真实临床报告。利用 MedStruct-S,我们对两种代表性范式进行了基准测试:基于后处理的仅编码器(encoder-only)序列标注,以及仅解码器(decoder-only)的结构化生成。测试涵盖了 4 个仅编码器模型和 5 个仅解码器模型,参数规模从 1.1 亿到 1030 亿不等。
Our results show that encoder-only models achieve the best performance for non-null-value key-conditioned QA despite being substantially smaller than decoder-only models. When comparing models of similar order of magnitude, encoder-only models still perform better overall. Without controlling for model scale, fine-tuned decoder-only models deliver the strongest overall results. These findings show that the benchmark provides a reliable and practical basis for selecting and comparing models across different semi-structured IE settings.
研究结果表明,尽管仅编码器模型在规模上远小于仅解码器模型,但在非空值关键条件问答任务中,它们表现出了最佳性能。在比较同等量级的模型时,仅编码器模型在整体上依然表现更优。若不控制模型规模,经过微调的仅解码器模型则能提供最强的整体结果。这些发现表明,该基准测试为在不同半结构化 IE 设置下选择和比较模型提供了一个可靠且实用的基础。