Harnessing the Latent Space: From Steering Vectors to Model Calibrators for Control and Trust

利用潜在空间：从引导向量到用于控制与信任的模型校准器

Abstract: Language models have changed from unreliable text generators to highly-capable large models with trillions of parameters. Capability increases come hand-in-hand with increases in scale, making understanding the internal representations of models more challenging.

摘要： 语言模型已经从不可靠的文本生成器演变为拥有数万亿参数、能力极强的大型模型。能力的提升伴随着规模的扩大，这使得理解模型的内部表征变得更具挑战性。

Since millions of users increasingly rely on language models to interact with external tools or make decisions in medium or high-stakes scenarios, we need to establish control over model behavior and know when to trust model outputs.

由于数以百万计的用户越来越依赖语言模型来与外部工具交互，或在中高风险场景中做出决策，我们需要建立对模型行为的控制，并明确何时该信任模型的输出。

In this paper, we discuss our contributions on harnessing the latent spaces by proposing steering vectors for control and developing latent space-based model calibrators for trust. Together, our contributions help demystify the latent spaces of language models and offer new insights into how to harness model internals to build more trustworthy language technology.

在本文中，我们讨论了我们在利用潜在空间方面的贡献：通过提出用于控制的“引导向量”（steering vectors），以及开发基于潜在空间、用于增强信任的“模型校准器”（model calibrators）。我们的研究共同助力于揭开语言模型潜在空间的神秘面纱，并为如何利用模型内部机制构建更值得信赖的语言技术提供了新的见解。