LLMs are stuck in a groupthink groove. This startup is trying to get them out.
LLMs are stuck in a groupthink groove. This startup is trying to get them out.
大语言模型陷入了“群体思维”的窠臼,这家初创公司正试图打破它。
EXECUTIVE SUMMARY Let’s start with a game. Open up your chatbot of choice—Claude, ChatGPT, Gemini—and type “Give me a random number between 1 and 10.” You’re going to get 7. Almost always. Now type “Another” and you’ll get 3 or 4. Type “Another” again and you’ll get 8 or 9. That won’t work every time—but if it did for you, you may wonder if I have superpowers. I don’t. The truth is that most large language models are stuck in a rut. They are far more predictable and far less creative in their responses than you might expect. That’s fine for tasks like coding or research, but groupthink is a problem when you’re brainstorming or planning your next vacation.
执行摘要 让我们从一个小游戏开始。打开你常用的聊天机器人——Claude、ChatGPT 或 Gemini——输入“给我一个 1 到 10 之间的随机数”。你几乎总是会得到 7。接着输入“再来一个”,你会得到 3 或 4。再输入一次,你会得到 8 或 9。这并不总是奏效,但如果它对你生效了,你可能会怀疑我是否有超能力。其实并没有。事实是,大多数大语言模型(LLM)都陷入了死胡同。它们的回答远比你预期的更具可预测性,也更缺乏创造力。对于编程或研究等任务来说,这没问题,但当你进行头脑风暴或规划下一次假期时,“群体思维”就是一个问题。
The Australian startup Springboards has a solution. It built an LLM called Flint, which has been trained to come up with a wider variety of responses than mainstream LLMs to open-ended questions such as “Where should I go in Europe?” “Most language models are fighting hallucinations,” says Springboards cofounder and CEO Pip Bingemann. “We welcome them.”
澳大利亚初创公司 Springboards 提供了一个解决方案。他们开发了一款名为 Flint 的大语言模型,经过训练,它在回答诸如“我应该去欧洲哪里旅游?”这类开放性问题时,能提供比主流大模型更多样化的答案。“大多数语言模型都在极力避免幻觉,”Springboards 联合创始人兼首席执行官 Pip Bingemann 说,“而我们欢迎幻觉。”
Bingemann introduced me to the random number game when he first showed me his company’s new model. It felt like watching an illusionist with a deck of cards. “This is our sales trick, and it works every single time,” he says. After ChatGPT and Claude both gave their 7s, Bingemann turned to Flint. It too came back with 7: “Aha, of course that was going to happen, but it’s okay—7 is a legitimate answer.” He restarted the session and prompted again: ChatGPT gave 7, Claude gave 7, Flint gave 3.7916.
当 Bingemann 第一次向我展示他公司的新模型时,他向我介绍了这个随机数游戏。这感觉就像在看魔术师玩扑克牌。“这是我们的推销技巧,而且每次都奏效,”他说。在 ChatGPT 和 Claude 都给出了 7 之后,Bingemann 转向了 Flint。它也给出了 7:“啊哈,这当然会发生,但没关系——7 是一个合理的答案。”他重新启动了会话并再次提示:ChatGPT 给出了 7,Claude 给出了 7,而 Flint 给出了 3.7916。
Run your way It’s not just numbers. When Bingemann asked ChatGPT and Claude to name a type of car, he predicted that it would be a Toyota or a Honda—and he was right. Flint came up with a Ford F-150. “There’s all this lost information that doesn’t get served up in these models,” he says. “They’re just as capable of saying a Buick or a Tesla. They just don’t—they’re biased.”
按你的方式奔跑 这不仅仅是数字的问题。当 Bingemann 要求 ChatGPT 和 Claude 说出一种汽车类型时,他预测它们会说是丰田或本田——结果他猜对了。而 Flint 给出的答案是福特 F-150。“这些模型中丢失了大量信息,它们没有被呈现出来,”他说,“它们完全有能力说出别克或特斯拉。它们只是没这么做——它们有偏见。”
Bingemann sent one last prompt to each of the three models: “Give me a tagline for a campaign for New Balance running shoes. Just the tagline.” Claude: “Run your way.” ChatGPT: “Run your way.” Flint: “Built to last, run to win.” It won’t win any awards, but at least it’s different.
Bingemann 向这三个模型发送了最后一个提示:“为 New Balance 跑鞋的营销活动写一句标语。只要标语。” Claude 回复:“Run your way(按你的方式奔跑)。” ChatGPT 回复:“Run your way。” Flint 回复:“Built to last, run to win(为持久而生,为胜利而跑)。” 它可能不会获奖,但至少它与众不同。
This weird limitation of LLMs is starting to get more attention. In November a team of researchers put out a paper, titled “Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond),” that exposed a remarkable degree of repetition not only in the answers from individual LLMs but between them as well. They found that different LLMs converged on very similar answers when prompted with open-ended questions. It’s not clear exactly why this happens, but the researchers speculate it’s because most LLMs today are trained in similar ways on similar data to do similar tasks. The team won the best paper award at NeurIPS, a major AI conference.
大语言模型的这种奇怪局限性正开始受到更多关注。去年 11 月,一个研究团队发表了一篇题为《人工蜂群思维:语言模型的开放式同质性(及其他)》的论文,揭示了不仅单个大模型内部,而且不同大模型之间都存在惊人的重复性。他们发现,当面对开放式问题时,不同的大模型往往会得出非常相似的答案。目前尚不清楚具体原因,但研究人员推测,这是因为当今大多数大模型都是在相似的数据上,以相似的方式进行训练,以完成相似的任务。该团队在人工智能顶级会议 NeurIPS 上获得了最佳论文奖。
When the researchers asked 25 different LLMs (including models from the top US firms as well as open-source models from China and elsewhere) 50 times each to write a metaphor about time, most of the 1,250 responses were a version of “Time is a river” or “Time is a weaver.” (I asked some of my colleagues the same question and six people gave me six different answers. My highlight: “Time is a favorite sweatshirt, shaped by a lifetime of wear.”)
当研究人员要求 25 个不同的大模型(包括美国顶级公司的模型以及来自中国和其他地区的开源模型)各写 50 次关于时间的隐喻时,1,250 个回答中大多数都是“时间是一条河”或“时间是一个织布工”的变体。(我问了我的几位同事同样的问题,六个人给了我六个不同的答案。我最喜欢的一个是:“时间是一件最喜欢的运动衫,被一生的磨损所塑造。”)
When you look for it, you see repetition everywhere, says Kieran Browne, cofounder and CTO at Springboards. “The way that most chat interfaces are designed, it makes it feel like you’re having a personal conversation,” he says. “I think most people don’t really realize the extent to which they are getting the same stuff as everybody else.”
“当你去留意时,你会发现重复无处不在,”Springboards 的联合创始人兼首席技术官 Kieran Browne 说。“大多数聊天界面的设计方式,让你感觉是在进行私人对话,”他说,“我认为大多数人并没有真正意识到,他们所获得的内容与其他人是多么的一致。”
Take another example: “What should I name my band?” Most models will say something involving “glass,” “neon,” “velvet,” or “static,” says Browne. When I tried it, ChatGPT spat out a list of 56 band names. At the top was “Glass Harbor.” Skimming through, I found “Static Empire,” “Neon Hearts,” and “Velvet Echo.” I asked Gemini; it gave me 15 suggestions, including “Static Horizon.” Some of the suggestions looked pretty cool, though. ChatGPT’s “Sofa Astronauts” caught my eye, so I googled it—and found that a band called Sofa Astronauts already exists.
再举个例子:“我该给我的乐队起什么名字?” Browne 说,大多数模型会给出包含“玻璃”、“霓虹”、“天鹅绒”或“静态”之类的词。当我尝试时,ChatGPT 吐出了 56 个乐队名称列表。排在首位的是“Glass Harbor(玻璃港湾)”。浏览列表,我发现了“Static Empire(静态帝国)”、“Neon Hearts(霓虹之心)”和“Velvet Echo(天鹅绒回声)”。我问了 Gemini,它给了我 15 个建议,包括“Static Horizon(静态地平线)”。不过,有些建议看起来确实很酷。ChatGPT 提出的“Sofa Astronauts(沙发宇航员)”吸引了我的注意,于是我谷歌了一下——结果发现已经有一个叫 Sofa Astronauts 的乐队了。
(OpenAI says that training models to give reliable and coherent answers can lead them to converge around familiar, high-probability responses and that pushing harder for novelty can lead to weaker or less reliable responses. It also notes that the “Artificial Hivemind” paper studied models from 2024 that have since been updated.)
(OpenAI 表示,训练模型给出可靠且连贯的答案可能会导致它们趋向于熟悉的高概率响应,而过度追求新颖性可能会导致回答变得薄弱或不可靠。该公司还指出,“人工蜂群思维”论文研究的是 2024 年的模型,这些模型此后已经进行了更新。)
Creative catapult Springboards has developed a tool backed by a selection of LLMs, including ChatGPT and Claude, that creative professionals in advertising or marketing can use to brainstorm ideas. The tool lets you drag around text produced by different models, picking the bits that you like and combining them into something new—in theory. Springboards is pitching Flint as an alternative model that users of its tool can select when looking for more variety.
创意弹射器 Springboards 开发了一款由多种大模型(包括 ChatGPT 和 Claude)支持的工具,广告或营销领域的创意专业人士可以使用它进行头脑风暴。该工具允许你拖拽不同模型生成的文本,挑选你喜欢的部分并将它们组合成新的东西——理论上是这样。Springboards 正在将 Flint 作为一种替代模型进行推广,当用户寻求更多样化的内容时,可以在其工具中选择使用它。
Zoe Scaman, founder of the business strategy startup Bodacious and chief strategy officer at 77X, a direct-to-fan marketing platform set up by Luka Dončić of the LA Lakers, has been trying it out. “I find it really useful for throwing me in completely different directions,” she says. “I use it if I want to catapult myself all over the place.”
商业战略初创公司 Bodacious 的创始人、77X(由洛杉矶湖人队卢卡·东契奇创立的直接面向粉丝的营销平台)的首席战略官 Zoe Scaman 一直在试用这款工具。“我发现它在把我推向完全不同的方向方面非常有用,”她说,“如果我想让自己跳出思维定式,我就会用它。”
In one test, Scaman pitted Flint against Claude, Gemini, and ChatGPT by giving each of the models a classic MBA case study: How would you reinvent a finance company for today’s youth? The three mainstream models all went down the same path, she says: “You know, we need to teach financial literacy in a fun and funky way—well, that’s nothing new.” But Flint came up with something different, suggesting that the whole concept of wealth accumulation should get a rebrand. “That was really interesting,” says Scaman. She notes that Flint is still a prototype and doesn’t work all the time. “It sometimes falls over when you start pushing it too far,” she says. “But I think…”
在一次测试中,Scaman 让 Flint 与 Claude、Gemini 和 ChatGPT 进行对比,给每个模型提供了一个经典的 MBA 案例研究:你将如何为当今的年轻人重塑一家金融公司?她说,这三个主流模型都走了同一条路:“你知道,我们需要以一种有趣且时髦的方式教授金融知识——嗯,这没什么新鲜的。”但 Flint 提出了不同的观点,建议将财富积累的整个概念进行品牌重塑。“这真的很有趣,”Scaman 说。她指出,Flint 目前仍处于原型阶段,并非总是有效。“当你把它推得太远时,它有时会崩溃,”她说,“但我认为……”