AGI Is Not Multimodal

AGI 并非多模态

“In projecting language back as the model for thought, we lose sight of the tacit embodied understanding that undergirds our intelligence.” –Terry Winograd “当我们把语言投射回思维模型时，我们便忽略了支撑我们智能的、隐含的具身理解。”——特里·威诺格拉德 (Terry Winograd)

The recent successes of generative AI models have convinced some that AGI is imminent. While these models appear to capture the essence of human intelligence, they defy even our most basic intuitions about it. They have emerged not because they are thoughtful solutions to the problem of intelligence, but because they scaled effectively on hardware we already had. Seduced by the fruits of scale, some have come to believe that it provides a clear pathway to AGI. 生成式 AI 模型的近期成功让一些人确信 AGI（通用人工智能）指日可待。虽然这些模型似乎捕捉到了人类智能的本质，但它们甚至违背了我们对智能最基本的直觉。它们的出现并非因为它们是解决智能问题的深思熟虑的方案，而仅仅是因为它们在我们现有的硬件上实现了有效的扩展。被规模效应的成果所诱惑，一些人开始相信，这为实现 AGI 提供了一条清晰的路径。

The most emblematic case of this is the multimodal approach, in which massive modular networks are optimized for an array of modalities that, taken together, appear general. However, I argue that this strategy is sure to fail in the near term; it will not lead to human-level AGI that can, e.g., perform sensorimotor reasoning, motion planning, and social coordination. Instead of trying to glue modalities together into a patchwork AGI, we should pursue approaches to intelligence that treat embodiment and interaction with the environment as primary, and see modality-centered processing as emergent phenomena. 最典型的例子就是多模态方法，在这种方法中，庞大的模块化网络针对一系列模态进行优化，综合起来似乎就具备了通用性。然而，我认为这种策略在短期内注定会失败；它不会导向能够进行感知运动推理、运动规划和社会协调等人类水平的 AGI。与其试图将各种模态拼凑成一个“补丁式”的 AGI，我们更应该追求将具身性和环境交互视为核心的智能研究路径，并将以模态为中心的处理视为一种涌现现象。

Preface: Disembodied definitions of Artificial General Intelligence — emphasis on general — exclude crucial problem spaces that we should expect AGI to be able to solve. A true AGI must be general across all domains. Any complete definition must at least include the ability to solve problems that originate in physical reality, e.g. repairing a car, untying a knot, preparing food, etc. As I will discuss in the next section, what is needed for these problems is a form of intelligence that is fundamentally situated in something like a physical world model. For more discussion on this, look out for Designing an Intelligence. Edited by George Konidaris, MIT Press, forthcoming. 前言：对通用人工智能（强调“通用”）的非具身定义，排除了我们期望 AGI 能够解决的关键问题领域。真正的 AGI 必须在所有领域都是通用的。任何完整的定义至少必须包括解决源于物理现实问题的能力，例如修理汽车、解开绳结、准备食物等。正如我将在下一节中讨论的那样，解决这些问题所需要的是一种从根本上植根于类似物理世界模型中的智能形式。关于此话题的更多讨论，敬请关注即将由麻省理工学院出版社出版、乔治·科尼达里斯 (George Konidaris) 编辑的《设计智能》(Designing an Intelligence)。

Why We Need the World, and How LLMs Pretend to Understand It 为什么我们需要世界，以及大语言模型 (LLM) 如何假装理解它

TLDR: I first argue that true AGI needs a physical understanding of the world, as many problems cannot be converted into a problem of symbol manipulation. It has been suggested by some that LLMs are learning a model of the world through next token prediction, but it is more likely that LLMs are learning bags of heuristics to predict tokens. This leaves them with a superficial understanding of reality and contributes to false impressions of their intelligence. 摘要：我首先提出，真正的 AGI 需要对世界有物理层面的理解，因为许多问题无法转化为符号操作问题。有人认为 LLM 正通过预测下一个 token 来学习世界模型，但更有可能的是，LLM 只是在学习一堆用于预测 token 的启发式规则。这使得它们对现实的理解流于表面，并造成了对其智能的错误印象。

The most shocking result of the predict-next-token objective is that it yields AI models that reflect a deeply human-like understanding of the world, despite having never observed it like we have. This result has led to confusion about what it means to understand language and even to understand the world — something we have long believed to be a prerequisite for language understanding. “预测下一个 token”这一目标最令人震惊的结果是，它产生的 AI 模型反映出一种极其类似人类的世界观，尽管它们从未像我们一样观察过这个世界。这一结果导致了人们对“理解语言”甚至“理解世界”的含义产生了困惑——而我们长期以来一直认为后者是理解语言的前提。

One explanation for the capabilities of LLMs comes from an emerging theory suggesting that they induce models of the world through next-token prediction. Proponents of this theory cite the prowess of SOTA LLMs on various benchmarks, the convergence of large models to similar internal representations, and their favorite rendition of the idea that “language mirrors the structure of reality,” a notion that has been espoused at least by Plato, Wittgenstein, Foucault, and Eco. While I’m generally in support of digging up esoteric texts for research inspiration, I’m worried that this pessimism has been taken too literally. Do LLMs really learn implicit models of the world? How could they otherwise be so proficient at language? 对 LLM 能力的一种解释来自一种新兴理论，该理论认为它们通过预测下一个 token 来归纳出世界模型。该理论的支持者引用了最先进 (SOTA) LLM 在各种基准测试中的表现、大型模型收敛到相似内部表示的现象，以及他们最推崇的观点——“语言反映了现实的结构”，这一概念至少得到了柏拉图、维特根斯坦、福柯和艾柯的支持。虽然我通常支持挖掘深奥的文本以获取研究灵感，但我担心这种隐喻被过于字面化地理解了。LLM 真的学习了隐含的世界模型吗？否则它们怎么可能在语言方面如此精通？

One source of evidence in favor of the LLM world modeling hypothesis is the Othello paper, wherein researchers were able to predict the board of an Othello game from the hidden states of a transformer model trained on sequences of legal moves. However, there are many issues with generalizing these results to models of natural language. For one, whereas Othello moves can provably be used to deduce the full state of an Othello board, we have no reason to believe that a complete picture of the physical world can be inferred by a linguistic description. 支持 LLM 世界建模假说的一个证据来源是关于黑白棋 (Othello) 的论文，研究人员能够从一个在合法走法序列上训练的 Transformer 模型的隐藏状态中预测出黑白棋的棋盘状态。然而，将这些结果推广到自然语言模型存在许多问题。首先，虽然黑白棋的走法可以被证明能用于推断棋盘的完整状态，但我们没有理由相信，仅凭语言描述就能推断出物理世界的完整图景。

What sets the game of Othello apart from many tasks in the physical world is that Othello fundamentally resides in the land of symbols, and is merely implemented using physical tokens to make it easier for humans to play. A full game of Othello can be played with just pen and paper, but one can’t, e.g., sweep a floor, do dishes, or drive a car with just pen and paper. To solve such tasks, you need some physical conception of the world beyond what humans can merely say about it. Whether that conception of the world is encoded in a formal world model or, e.g., a value function is up for debate, but it is clear that there are many problems in the physical world that cannot be fully represented by a system of symbols and solved with mere symbol manipulation. 黑白棋与物理世界中许多任务的区别在于，黑白棋从根本上存在于符号领域，仅仅是为了方便人类游玩而使用了物理棋子。完整的黑白棋游戏可以用纸笔完成，但你无法仅用纸笔扫地、洗碗或开车。要解决这些任务，你需要某种超越人类语言描述的、对世界的物理概念。这种世界概念是编码在正式的世界模型中，还是编码在价值函数中，尚有争议，但显而易见的是，物理世界中存在许多无法完全由符号系统表示并仅通过符号操作来解决的问题。

Another issue stated in Melanie Mitchell’s recent piece and supported by this paper, is that there is evidence that generative models can score remarkably well on sequence prediction tasks while failing to learn models of the worlds that created such sequence data, e.g. by learning comprehensive sets of idiosyncratic heuristics. E.g., it was pointed out in this blog post that OthelloGPT learned sequence prediction rules that don’t actually hold for all possible Othello games, like “if the token for B4 does not appear before A4 in the input string, then B4 is empty.” While one can argue that it doesn’t matter how a world model predicts the next state of the world, it should raise suspicion when that prediction reflects a better understanding of the training data than the underlying world that led to such data. This, unfortunately, is the central fault of the predict-next-token objective, which seeks only to retain information relevant to the prediction of the next token. If it can be done with something easier to learn than a world model, it likely will be. 梅兰妮·米切尔 (Melanie Mitchell) 最近的文章中指出的另一个问题（也得到了本文的支持）是，有证据表明生成模型在序列预测任务上可以取得非常好的分数，却未能学习到产生这些序列数据的世界模型，例如它们是通过学习一套全面的、特质化的启发式规则来实现的。例如，这篇博客文章指出，OthelloGPT 学习到的序列预测规则实际上并不适用于所有可能的黑白棋对局，比如“如果输入字符串中 B4 的 token 没有出现在 A4 之前，那么 B4 就是空的”。虽然有人可能会争辩说，世界模型如何预测世界的下一个状态并不重要，但当这种预测反映出对训练数据的理解超过了产生这些数据的底层世界时，就应该引起怀疑。不幸的是，这正是“预测下一个 token”这一目标的核心缺陷，它只寻求保留与预测下一个 token 相关的信息。如果能用比世界模型更容易学习的东西来完成任务，它很可能就会那样做。