Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

以自我评估的语言发声：论大语言模型在机器翻译中的口头置信度

Abstract: The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. 摘要： 大语言模型（LLMs）在翻译领域的迅速普及，要求我们对其输出结果的置信度可靠性进行深入研究。

Unlike many generation tasks, translation errors and confidence levels can be useful at different levels of granularity (tokens, words, or spans). 与许多生成任务不同，翻译错误和置信度水平在不同的粒度（标记、词汇或片段）上可能具有不同的参考价值。

Unsupervised approaches based on internal signals like predicted probabilities can be misleading because they reflect certainty among alternatives rather than correctness. 基于预测概率等内部信号的无监督方法可能会产生误导，因为它们反映的是模型在多个候选选项间的确定性，而非输出结果的正确性。

In addition, they require access to such internal signals. 此外，这些方法还需要获取模型的内部信号。

Here, we devise five verbalized methods of extracting an LLM’s per-token confidence without those shortcomings and compare their reliability with that of the model’s internal signals of certainty. 在此，我们设计了五种口头化（verbalized）方法来提取大语言模型的逐标记（per-token）置信度，从而规避了上述缺陷，并将其可靠性与模型内部的确定性信号进行了对比。

We evaluate reliability using two forms of alignment: fine-grained error detection and calibration. 我们通过两种对齐方式来评估可靠性：细粒度错误检测和校准。

For both, internal and verbalized methods perform similarly, although results vary by model. 在这两种评估方式中，内部信号方法和口头化方法的表现相似，尽管具体结果因模型而异。

Interestingly, we find little to no correlation between internal and verbalized methods. 有趣的是，我们发现内部信号方法与口头化方法之间几乎没有相关性。