Geometric Deviation as an Unsupervised Pre-Generation Reliability Signal: Probing LLM Representations for Answerability

几何偏差作为一种无监督的生成前可靠性信号：探索大语言模型表征的可回答性

Abstract: A reliable language model should be able to signal, prior to generation, when a query falls outside its knowledge. We investigate whether representation geometry can provide such a pre-generation signal by measuring the deviation of hidden states from an answerable reference set, requiring no labeled failure data and no access to model outputs. 摘要： 一个可靠的语言模型应当能够在生成之前，就识别出查询内容是否超出了其知识范围。我们研究了表征几何（representation geometry）是否能通过测量隐藏状态与可回答参考集之间的偏差，来提供这种生成前的信号。该方法无需标注的失败数据，也无需访问模型的输出。

Across three instruction-tuned models (Llama 3.1-8B, Qwen 2.5-7B, and Mistral-7B-Instruct) and three prompt forms (Math, Fact, Code), we find that geometry primarily encodes task form. Within mathematical prompts, unanswerable inputs consistently deviate from the answerable centroid, yielding strong separation (ROC-AUC 0.78-0.84). This single-pass pre-generation signal outperforms a simple refusal baseline and compares favorably to self-consistency. It also captures cases where models do not explicitly refuse. 在三个经过指令微调的模型（Llama 3.1-8B、Qwen 2.5-7B 和 Mistral-7B-Instruct）以及三种提示形式（数学、事实、代码）中，我们发现几何结构主要编码了任务形式。在数学提示中，不可回答的输入始终偏离可回答的中心点，从而产生显著的区分度（ROC-AUC 0.78-0.84）。这种单次生成的预判信号优于简单的拒绝基准，且表现不亚于自洽性（self-consistency）方法。它还能捕捉到模型未明确拒绝的情况。

In contrast, no reliable geometric signal emerges for factual prompts, indicating that the effect is form-conditional rather than universal. Code prompts show large effect sizes with higher variance, suggesting partial generalization beyond mathematical form. A layer-wise analysis reveals that the signal arises in early layers and gradually attenuates toward the output. 相比之下，事实类提示中并未出现可靠的几何信号，这表明该效应是形式依赖的，而非普适的。代码类提示显示出较大的效应量，但方差较高，这暗示了其在数学形式之外存在部分泛化能力。逐层分析表明，该信号在早期层中产生，并向输出层逐渐减弱。

These results suggest that answerability-related geometry is established before the final stages of generation. Together, these findings indicate that geometric deviation can serve as a lightweight pre-generation signal that is reliable in structured domains with formal answerability constraints, with clear boundaries on where it generalizes. 这些结果表明，与可回答性相关的几何结构在生成的最后阶段之前就已经建立。综上所述，这些发现表明几何偏差可以作为一种轻量级的生成前信号，在具有形式化可回答约束的结构化领域中表现可靠，并明确了其泛化的边界。