When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

When Retrieval Doesn’t Help: A Large-Scale Study of Biomedical RAG

当检索失效时：生物医学 RAG 的大规模研究

Medical question answering is a high-stakes setting where factual errors can have serious consequences. Retrieval-augmented generation (RAG) is widely viewed as a promising solution, and prior work has reported substantial gains for large medical QA models. 医疗问答是一个高风险领域，事实性错误可能导致严重的后果。检索增强生成（RAG）被广泛视为一种有前景的解决方案，先前的研究也报告称其为大型医疗问答模型带来了显著的性能提升。

We revisit this assumption across a broad range of open-weight instruction-tuned models spanning 7B to 72B parameters. Across five models, ten biomedical QA datasets, four retrieval methods, and four retrieval corpora, we find that retrieval yields only small and inconsistent improvements over a no-retrieval baseline, typically within 1-2 points. 我们针对一系列参数规模从 7B 到 72B 的开源指令微调模型，重新审视了这一假设。通过对五个模型、十个生物医学问答数据集、四种检索方法以及四个检索语料库的测试，我们发现，与不使用检索的基线相比，检索仅能带来微小且不稳定的提升，通常在 1-2 个百分点以内。

In contrast, the choice of backbone model has a much larger effect than the choice of retriever or corpus, and expert and layman retrieval sources perform similarly in most settings. These results suggest that the main bottleneck is not retrieval quality alone, but the model’s limited ability to use retrieved evidence effectively. 相比之下，主干模型的选择比检索器或语料库的选择对结果的影响要大得多，且在大多数设置下，专家级检索源与普通检索源的表现差异不大。这些结果表明，主要的瓶颈不仅仅在于检索质量，还在于模型有效利用所检索证据的能力有限。