Do text embeddings perfectly encode text?

Do text embeddings perfectly encode text?

文本嵌入(Text Embeddings)能完美编码文本吗?

The rise of the vector database. As a result of the rapid advancement of generative AI in recent years, many companies are rushing to integrate AI into their businesses. One of the most common ways of doing this is to build AI systems that answer questions concerning information that can be found within a database of documents. Most solutions for such a problem are based on one key technique: Retrieval Augmented Generation (RAG). 向量数据库的兴起。由于近年来生成式 AI 的飞速发展,许多公司正竞相将 AI 集成到其业务中。实现这一目标最常见的方法之一是构建 AI 系统,针对数据库文档中可查阅的信息进行问答。解决此类问题的大多数方案都基于一项关键技术:检索增强生成(RAG)。

This is what lots of people do now as a cheap and easy way to get started using AI: store lots of documents in a database, have the AI retrieve the most relevant documents for a given input, and then generate a response to the input that is informed by the retrieved documents. 这是目前许多人开始使用 AI 的一种廉价且简便的方法:将大量文档存储在数据库中,让 AI 针对给定的输入检索最相关的文档,然后根据检索到的文档生成回复。

These RAG systems determine document relevancy by using “embeddings”, vector representations of documents produced by an embedding model. These embeddings are supposed to represent some notion of similarity, so documents that are relevant for search will have high vector similarity in embedding space. 这些 RAG 系统通过使用“嵌入”(embeddings)来确定文档的相关性,即由嵌入模型生成的文档向量表示。这些嵌入旨在表示某种相似性概念,因此与搜索相关的文档在嵌入空间中将具有较高的向量相似度。

The prevalence of RAG has led to the rise of the vector database, a new type of database designed for storing and searching through large numbers of embeddings. Hundreds of millions of dollars of funding have been given out to startups that claim to facilitate RAG by making embedding search easy. And the effectiveness of RAG is the reason why lots of new applications are converting text to vectors and storing them in these vector databases. RAG 的普及带动了向量数据库的兴起,这是一种旨在存储和搜索大量嵌入的新型数据库。数亿美元的资金已投入到那些声称通过简化嵌入搜索来促进 RAG 应用的初创公司。RAG 的有效性正是许多新应用将文本转换为向量并存储在这些向量数据库中的原因。

Embeddings are hard to read

嵌入难以解读

So what is stored in a text embedding? Beyond the requirement of semantic similarity, there are no constraints on which embedding must be assigned for a given text input. Numbers within embedding vectors can be anything, and vary based on their initialization. We can interpret the similarities of embedding with others but have no hope ever understanding the individual numbers of an embedding. 那么,文本嵌入中到底存储了什么?除了语义相似性的要求外,对于给定的文本输入应该分配什么样的嵌入,并没有任何限制。嵌入向量中的数字可以是任何值,并根据其初始化方式而变化。我们可以解读嵌入与其他嵌入之间的相似性,但永远无法理解嵌入中单个数字的具体含义。

Now imagine you’re a software engineer building a RAG system for your company. You decide to store your vectors in a vector database. You notice that in a vector database, what’s stored are embedding vectors, not the text data itself. The database fills up with rows and rows of random-seeming numbers that represent text data but never ‘sees’ any text data at all. You know that the text corresponds to customer documents that are protected by your company’s privacy policy. But you’re not really sending text off-premises at any time; you only ever send embedding vectors, which look to you like random numbers. 现在想象一下,你是一名正在为公司构建 RAG 系统的软件工程师。你决定将向量存储在向量数据库中。你注意到,在向量数据库中存储的是嵌入向量,而不是文本数据本身。数据库中填满了代表文本数据、看起来像随机数字的行,但数据库本身从未“看到”任何文本数据。你知道这些文本对应的是受公司隐私政策保护的客户文档。但你并没有在任何时候将文本发送到外部;你发送的始终只是嵌入向量,在你看来,它们就像随机数字一样。

What if someone hacks into the database and gains access to all your text embedding vectors – would this be bad? Or if the service provider wanted to sell your data to advertisers – could they? Both scenarios involve being able to take embedding vectors and invert them somehow back to text. 如果有人入侵数据库并获取了你所有的文本嵌入向量,这会有危险吗?或者,如果服务提供商想把你的数据卖给广告商,他们能做到吗?这两种情况都涉及能否将嵌入向量以某种方式反转回文本。

From text to embeddings…back to text

从文本到嵌入……再回到文本

The problem of recovering text from embeddings is exactly the scenario we tackle in our paper Text Embeddings Reveal As Much as Text (EMNLP 2023). Are embedding vectors a secure format for information storage and communication? Put simply: can input text be recovered from output embeddings? 从嵌入中恢复文本的问题,正是我们在论文《Text Embeddings Reveal As Much as Text》(EMNLP 2023)中所探讨的场景。嵌入向量是用于信息存储和通信的安全格式吗?简单来说:能否从输出的嵌入中恢复出输入的文本?

Before diving into solutions, let’s think about the problem a little bit more. Text embeddings are the output of neural networks, sequences of matrix multiplications joined by nonlinear function operations applied to input data. In traditional text processing neural networks, a string input is split into a number of token vectors, which repeatedly undergo nonlinear function operations. At the output layer of the model, tokens are averaged into a single embedding vector. 在深入研究解决方案之前,让我们进一步思考这个问题。文本嵌入是神经网络的输出,即应用于输入数据的矩阵乘法序列,并通过非线性函数运算连接。在传统的文本处理神经网络中,字符串输入被拆分为多个标记向量(token vectors),这些向量反复经历非线性函数运算。在模型的输出层,这些标记被平均为一个单一的嵌入向量。

A maxim from the signal processing community known as the data processing inequality tells us that functions cannot add information to an input, they can only sustain or decrease the amount of information available. Even though conventional wisdom tells us that deeper layers of a neural network are constructing ever-higher-order representations, they aren’t adding any information about the world that didn’t come in on the input side. Additionally, the nonlinear layers certainly destroy some information. One ubiquitous nonlinear layer in modern neural networks is the “ReLU” function, which simply sets all negative inputs to zero. After applying ReLU throughout the many layers of a typical text embedding model, it is not possible to retain all the information from the input. 信号处理领域的一条准则——数据处理不等式(Data Processing Inequality)告诉我们,函数无法为输入增加信息,它们只能维持或减少可用信息的总量。尽管传统观点认为神经网络的深层正在构建更高阶的表示,但它们并没有增加任何输入端不存在的关于世界的信息。此外,非线性层肯定会破坏部分信息。现代神经网络中一种普遍存在的非线性层是“ReLU”函数,它简单地将所有负输入设置为零。在典型的文本嵌入模型的多个层中应用 ReLU 后,不可能保留输入的所有信息。

Inversion in other contexts

其他背景下的反转

Similar questions about information content have been asked in the computer vision community. Several results have shown that deep representations (embeddings, essentially) from image models can be used to recover the input images with some degree of fidelity. An early result (Dosovitskiy, 2016) showed that images can be recovered from the feature outputs of deep convolutional neural networks (CNNs). Given the high-level feature representation from a CNN, they could invert it to produce a blurry-but-similar version of the original input image. 计算机视觉领域也提出了关于信息内容的类似问题。多项研究结果表明,来自图像模型的深度表示(本质上就是嵌入)可以用来以一定的保真度恢复输入图像。一项早期研究(Dosovitskiy, 2016)表明,可以从深度卷积神经网络(CNN)的特征输出中恢复图像。给定来自 CNN 的高层特征表示,他们能够将其反转,从而生成原始输入图像的一个模糊但相似的版本。

People have improved on image embedding inversion process since 2016: models have been developed that do inversion with higher accuracy, and have been shown to work across more settings. Surprisingly, some work has shown that images can be inverted from the outputs of an ImageNet classifier (1000 class probabilities). 自 2016 年以来,人们改进了图像嵌入的反转过程:研究人员开发出了精度更高的反转模型,并证明其适用于更多场景。令人惊讶的是,一些研究表明,甚至可以从 ImageNet 分类器(1000 个类别概率)的输出中反转出图像。

The journey to vec2text

通往 vec2text 之旅

If inversion is possible for image representations, then why can’t it work for text? Let’s consider a toy problem of recovering text embeddings. For our toy setting we’ll restrict text inputs to 32 tokens. 如果图像表示可以反转,那么为什么文本不行呢?让我们考虑一个恢复文本嵌入的简单问题(toy problem)。在我们的简化设置中,我们将文本输入限制为 32 个标记。