Readable but Not Controllable: Neuron-Level Evidence for Medical LLM Hallucination

可读但不可控：医疗大模型幻觉的神经元级证据

Abstract: Hallucination remains one of the central obstacles to deploying medical LLMs. Yet, even when hallucination can be detected, it is still unclear whether the internal representations associated with it can be used for control rather than detection alone.

摘要： 幻觉仍然是部署医疗大模型（LLM）的核心障碍之一。然而，即使能够检测到幻觉，目前尚不清楚与幻觉相关的内部表征是否不仅能用于检测，还能用于控制。

Using four open-source models across a suite of medical question-answering datasets, we show that a simple, carefully conditioned probe can reliably detect hallucination, with AUROC scores between 0.77 and 0.86 in our case.

通过在多组医疗问答数据集上使用四个开源模型，我们证明了一个简单且经过精心调节的探测器（probe）可以可靠地检测出幻觉，其 AUROC 分数在 0.77 到 0.86 之间。

We further show that this signal is distributed and redundant rather than narrowly localized. Systematically selected neurons outperform random neurons only at very small subset sizes, whereas random subsets of a few hundred neurons recover nearly the full signal, and low-dimensional random projections preserve most of the detection performance.

我们进一步证明，这种信号是分布式且冗余的，而非局限于特定区域。系统性筛选出的神经元仅在极小规模的子集下优于随机神经元；而由几百个神经元组成的随机子集几乎能恢复全部信号，低维随机投影也能保留大部分检测性能。

Beyond detection, we test whether this representation is causally actionable. Across 16 model—dataset combinations, our results reveal a sharp gap between decodability and controllability. The same internal structure that makes hallucination easy to detect does not translate into reliable neuron-level control.

在检测之外，我们测试了这种表征是否具有因果可操作性。在 16 种模型与数据集的组合中，我们的结果揭示了“可解码性”与“可控性”之间存在巨大鸿沟。使得幻觉易于被检测到的相同内部结构，并不能转化为可靠的神经元级控制。

These findings show that medical hallucination seems to be readily visible in internal activations, but not easily corrected by steering the neurons most associated with it. More broadly, our results suggest that hallucination mitigation is not simply a matter of identifying the right neurons, and point to a deeper separation between what representations reveal and what they allow us to change.

这些发现表明，医疗幻觉似乎很容易在内部激活中被观察到，但通过操纵与之最相关的神经元却难以轻易纠正。更广泛地说，我们的研究结果表明，缓解幻觉不仅仅是识别出正确的神经元，这指向了表征所揭示的内容与表征所允许我们改变的内容之间存在更深层的分离。