MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports

MedStruct-S：针对 OCR 临床报告的关键发现、关键条件问答及半结构化提取的基准测试

Abstract: Semi-structured information extraction (IE) from OCR-derived clinical reports is crucial for efficiently reconstructing patients’ longitudinal medical histories. In practice, this scenario commonly involves three tasks: (i) field-header (key) discovery, (ii) key-conditioned question answering (QA), and (iii) end-to-end key-value pair extraction.

摘要： 从 OCR（光学字符识别）生成的临床报告中进行半结构化信息提取（IE），对于高效重建患者的纵向病史至关重要。在实际应用中，该场景通常涉及三项任务：(i) 字段标题（键）发现，(ii) 关键条件问答（QA），以及 (iii) 端到端键值对提取。

However, existing evaluations often under-model two factors: heterogeneous and incompletely known key representations, and OCR-induced noise. This makes it difficult to assess model robustness in real-world settings. We present MedStruct-S, a benchmark specifically designed to evaluate these tasks under unknown keys and OCR noise.

然而，现有的评估往往忽视了两个因素：异构且不完全已知的键表示，以及 OCR 引入的噪声。这使得在现实场景中评估模型的鲁棒性变得困难。我们提出了 MedStruct-S，这是一个专门设计用于在未知键和 OCR 噪声条件下评估这些任务的基准测试。

MedStruct-S contains 3,582 annotated real-world clinical report pages. Using MedStruct-S, we benchmark two representative paradigms: encoder-only sequence labeling with post-processing and decoder-only structured generation, covering four encoder-only and five decoder-only models spanning 0.11B to 103B parameters.

MedStruct-S 包含 3,582 页经过标注的真实临床报告。利用 MedStruct-S，我们对两种代表性范式进行了基准测试：基于后处理的仅编码器（encoder-only）序列标注，以及仅解码器（decoder-only）的结构化生成。测试涵盖了 4 个仅编码器模型和 5 个仅解码器模型，参数规模从 1.1 亿到 1030 亿不等。

Our results show that encoder-only models achieve the best performance for non-null-value key-conditioned QA despite being substantially smaller than decoder-only models. When comparing models of similar order of magnitude, encoder-only models still perform better overall. Without controlling for model scale, fine-tuned decoder-only models deliver the strongest overall results. These findings show that the benchmark provides a reliable and practical basis for selecting and comparing models across different semi-structured IE settings.

研究结果表明，尽管仅编码器模型在规模上远小于仅解码器模型，但在非空值关键条件问答任务中，它们表现出了最佳性能。在比较同等量级的模型时，仅编码器模型在整体上依然表现更优。若不控制模型规模，经过微调的仅解码器模型则能提供最强的整体结果。这些发现表明，该基准测试为在不同半结构化 IE 设置下选择和比较模型提供了一个可靠且实用的基础。