What's Missing From LLM Chatbots: A Sense of Purpose

What’s Missing From LLM Chatbots: A Sense of Purpose

LLM 聊天机器人缺失的一环:目标感

LLM-based chatbots’ capabilities have been advancing every month. These improvements are mostly measured by benchmarks like MMLU, HumanEval, and MATH (e.g. sonnet 3.5, gpt-4o). However, as these measures get more and more saturated, is user experience increasing in proportion to these scores? If we envision a future of human-AI collaboration rather than AI replacing humans, the current ways of measuring dialogue systems may be insufficient because they measure in a non-interactive fashion.

基于大语言模型(LLM)的聊天机器人能力每月都在进步。这些改进主要通过 MMLU、HumanEval 和 MATH 等基准测试(如 Sonnet 3.5、GPT-4o)来衡量。然而,随着这些指标逐渐趋于饱和,用户体验是否也随之同步提升了呢?如果我们设想的是人机协作而非 AI 取代人类的未来,那么目前衡量对话系统的方法可能是不够的,因为它们大多是在非交互式的情况下进行评估的。

Why does purposeful dialogue matter? Purposeful dialogue refers to a multi-round user-chatbot conversation that centers around a goal or intention. The goal could range from a generic one like “harmless and helpful” to more specific roles like “travel planning agent”, “psycho-therapist” or “customer service bot.”

为什么“目标导向的对话”很重要?目标导向的对话是指围绕特定目标或意图展开的多轮人机对话。这些目标可以是从“无害且有益”这样通用的要求,到“旅行规划代理”、“心理治疗师”或“客服机器人”等更具体的角色设定。

Travel planning is a simple, illustrative example. Our preferences, fellow travelers’ preference, and all the complexities of real-world situations make transmitting all information in one pass way too costly. However, if multiple back-and-forth exchanges of information are allowed, only important information gets selectively exchanged. Negotiation theory offers an analogy of this—iterative bargaining yields better outcomes than a take-it-or-leave-it offer. In fact, sharing information is only one aspect of dialogue. In Terry Winograd’s words: “All language use can be thought of as a way of activating procedures within the hearer.” We can think of each utterance as a deliberate action that one party takes to alter the world model of the other.

旅行规划是一个简单且直观的例子。我们的偏好、同行者的偏好以及现实世界中各种复杂的情况,使得在一次交互中传递所有信息变得成本过高。然而,如果允许进行多次往复的信息交换,我们就可以有选择地只交换重要信息。谈判理论提供了一个类比:迭代式的讨价还价往往比“要么接受,要么放弃”的单次报价能产生更好的结果。事实上,信息共享只是对话的一个方面。用 Terry Winograd 的话来说:“所有的语言使用都可以被视为一种激活听者内部程序的方式。”我们可以将每一次话语视为一方为了改变另一方的世界模型而采取的刻意行动。

What if both parties have more complicated, even hidden goals? In this way, purposeful dialogue provides us with a way of formulating human-AI interactions as a collaborative game, where the goal of chatbot is to help humans achieve certain goals. This might seem like an unnecessary complexity that is only a concern for academics. However, purposeful dialogue could be beneficial even for the most hard-nosed, product-oriented research direction like code generation. Existing coding benchmarks mostly measure performances in a one-pass generation setting; however, for AI to automate solving ordinary Github issues (like in SWE-bench), it’s unlikely to be achieved by a single action—the AI needs to communicate back and forth with human software engineers to make sure it understands the correct requirements, ask for missing documentation and data, and even ask humans to give it a hand if needed. In a similar vein to pair programming, this could reduce the defects of code but without the burden of increasing man-hours.

如果双方都有更复杂、甚至隐藏的目标呢?通过这种方式,目标导向的对话为我们将人机交互构建为一种“协作博弈”提供了路径,其中聊天机器人的目标就是帮助人类实现特定目标。这看起来可能是不必要的复杂性,似乎只是学术界关心的问题。然而,即使对于像代码生成这样最务实、最以产品为导向的研究方向,目标导向的对话也大有裨益。现有的编码基准测试大多是在单次生成设置下衡量性能;然而,AI 要想自动化解决普通的 Github 问题(如 SWE-bench 中的任务),很难通过单一动作完成——AI 需要与人类软件工程师进行反复沟通,以确保理解正确的需求、索取缺失的文档和数据,甚至在必要时请求人类协助。这类似于结对编程,可以在不增加工时负担的情况下减少代码缺陷。

Moreover, with the introduction of turn-taking, many new possibilities can be unlocked. As interactions become long-term and memory is built, the chatbot can gradually update user profiles. It can also adapt to their preferences. Imagine a personal assistant (e.g., IVA, Siri) that, through daily interaction, learns your preferences and intentions. It can read your resources of new information automatically (e.g., twitter, arxiv, Slack, NYT) and provide you with a morning news summary according to your preferences. It can draft emails for you and keep improving by learning from your edits.

此外,随着“轮流对话”(turn-taking)机制的引入,许多新的可能性将被解锁。随着交互的长期化和记忆的建立,聊天机器人可以逐渐更新用户画像,并适应用户的偏好。想象一下一个个人助理(如 IVA、Siri),它通过日常交互学习你的偏好和意图。它可以自动阅读你的信息源(如 Twitter、arXiv、Slack、纽约时报),并根据你的偏好为你提供早间新闻摘要。它还可以为你起草邮件,并通过学习你的修改不断进步。

In a nutshell, meaningful interactions between people rarely begin with complete strangers and conclude in just one exchange. Humans naturally interact with each other through multi-round dialogues and adapt accordingly throughout the conversation. However, doesn’t that seem exactly the opposite of predicting the next token, which is the cornerstone of modern LLMs? Below, let’s take a look at the makings of dialogue systems.

简而言之,人与人之间有意义的互动很少始于完全陌生,并仅通过一次交流就结束。人类自然地通过多轮对话进行互动,并在整个对话过程中做出相应的调整。然而,这看起来不正是与“预测下一个 Token”(现代 LLM 的基石)背道而驰吗?下面,让我们来看看对话系统是如何构建的。

How were/are dialogue systems made? Let’s jump back to the 1970s, when Roger Schank introduced his “restaurant script” as a kind of dialogue system [1]. This script breaks down the typical restaurant experience into steps like entering, ordering, eating, and paying, each with specific scripted utterances. Back then, every piece of dialogue in these scenarios was carefully planned out, enabling AI systems to mimic realistic conversations. ELIZA, a Rogerian psychotherapist simulator, and PARRY, a system mimicking a paranoid individual, were two other early dialogue systems until the dawn of machine learning.

对话系统是如何构建的?让我们回到 20 世纪 70 年代,当时 Roger Schank 引入了他的“餐厅脚本”作为一种对话系统 [1]。该脚本将典型的餐厅体验分解为进入、点餐、用餐和结账等步骤,每一步都有特定的脚本话语。那时,这些场景中的每一段对话都是精心策划的,使 AI 系统能够模拟真实的对话。ELIZA(罗杰斯心理治疗师模拟器)和 PARRY(模拟偏执狂个体的系统)是机器学习时代到来之前的另外两个早期对话系统。

Compare this approach to the LLM-based dialogue system today, it seems mysterious how models trained to predict the next token could do anything at all with engaging in dialogues. Therefore, let’s take a close examination of how dialogue systems are made, with an emphasis on how the dialogue format comes into play:

将这种方法与当今基于 LLM 的对话系统进行比较,人们不禁感到困惑:那些仅仅被训练来预测下一个 Token 的模型,究竟是如何参与对话的?因此,让我们仔细审视对话系统是如何构建的,重点关注对话格式在其中发挥的作用:

(1) Pretraining: a sequence model is trained to predict the next token on a gigantic corpus of mixed internet texts. The compositions may vary but they are predominantly news, books, Github code, with a small blend of forum-crawled data such as from Reddit, Stack Exchange, which may contain dialogue-like data.

(1) 预训练:序列模型在海量的混合互联网文本语料库上进行训练,以预测下一个 Token。语料构成可能有所不同,但主要是新闻、书籍、Github 代码,并混合了少量从 Reddit、Stack Exchange 等论坛抓取的数据,其中可能包含类似对话的数据。

(2) Introduce dialogue formatting: because the sequence model only processes strings, while the most natural representation of dialogue history is a structured index of system prompts and past exchanges, a certain kind of formatting must be introduced for the purpose of conversion. Some Huggingface tokenizers provide this method called tokenizer.apply_chat_template for the convenience of users. The exact formatting differs from model to model, but it usually involves guarding the system prompts with or in the hope that the pretrained model could allocate more attention weights to them. The system prompt plays a significant role in adapting language models to downstream applications and ensuring its safe behavior. Notably, the choice of the format is arbitrary at this step—pretraining corpus doesn’t follow this format.

(2) 引入对话格式:由于序列模型只能处理字符串,而对话历史最自然的表示方式是系统提示词和过去交互的结构化索引,因此必须引入某种格式来进行转换。为了方便用户,一些 Huggingface 分词器提供了名为 tokenizer.apply_chat_template 的方法。具体的格式因模型而异,但通常涉及使用 <system><INST> 标记来保护系统提示词,希望预训练模型能为它们分配更多的注意力权重。系统提示词在使语言模型适应下游应用并确保其安全行为方面起着重要作用。值得注意的是,在这一步中,格式的选择是任意的——预训练语料库并不遵循这种格式。

(3) RLHF: In this step, the chatbot is directly rewarded or penalized for generating desired or undesired answers. It’s worth noting that this is the first time the int…

(3) RLHF(基于人类反馈的强化学习):在这一步中,聊天机器人因生成期望或不期望的答案而直接获得奖励或惩罚。值得注意的是,这是第一次……