Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

重访用于 3D CT 报告生成的大语言模型适配：缩放与诊断先验研究

Abstract: Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology.

摘要： 多模态学习（包括大语言模型 LLM 和视觉语言模型 VLM）的最新进展已展现出对自然图像的强大适应性。然而，将其应用扩展到医学领域，特别是针对体素（3D）图像时，由于高计算复杂度、体素依赖性以及视觉特征与临床术语之间的语义鸿沟，面临着巨大挑战。

Naively fine-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality. In this study, we investigate parameter-efficient adaptation strategies for volumetric CT report generation and introduce RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that minimizes the need for extensive parameter training.

在有限的医学数据上简单地微调 LLM，往往会导致过拟合和临床幻觉，即模型在生成时优先考虑语言流畅性而非临床事实准确性。在本研究中，我们探讨了用于 3D CT 报告生成的参数高效适配策略，并引入了 RAD3D-Prefix，这是一个轻量级的诊断先验条件框架，最大限度地减少了对大规模参数训练的需求。

This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets.

该模块将图像嵌入与多标签诊断分类逻辑值（logits）相结合，在弥合语义鸿沟的同时保留了关键的临床细节。通过保持 LLM 冻结，我们的方法仅需极少量的可训练参数，并降低了在小型特定领域数据集上过拟合的风险。

Through a systematic study spanning LLMs from 96.1M to 1.6B parameters, we find that fine-tuning is most beneficial for smaller LLMs, whereas freezing larger (~1B+ LLMs) and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency.

通过对参数量从 9610 万到 16 亿不等的 LLM 进行系统性研究，我们发现微调对较小的 LLM 最为有效；而对于较大的模型（约 10 亿参数以上），冻结模型并仅训练轻量级投影层，能在性能、泛化能力和计算效率之间提供更优的平衡。

Across multiple automatic metrics and a clinical reader study, RAD3D-Prefix outperforms comparable parameter-efficient baselines and demonstrates strong out-of-domain generalization while using substantially fewer trainable parameters than fully fine-tuned alternatives.

在多项自动评估指标和临床医生评估研究中，RAD3D-Prefix 的表现优于同类参数高效基线模型，并展现出强大的域外泛化能力，同时其可训练参数量远少于全参数微调方案。