This startup’s new mechanistic interpretability tool lets you debug LLMs
This startup’s new mechanistic interpretability tool lets you debug LLMs
这家初创公司推出的全新机械可解释性工具,让你能够调试大语言模型
The San Francisco–based startup Goodfire just released a new tool, called Silico, that lets researchers and engineers peer inside an AI model and adjust its parameters—the settings that determine a model’s behavior—during training. This could give model makers more fine-grained control over how this technology is built than was once thought possible. 总部位于旧金山的初创公司 Goodfire 刚刚发布了一款名为 Silico 的新工具。它允许研究人员和工程师深入 AI 模型内部,并在训练过程中调整其参数(即决定模型行为的设置)。这可能使模型开发者能够以比以往想象中更精细的方式,掌控这项技术的构建过程。
Goodfire claims Silico is the first off-the-shelf tool of its kind that can help developers debug all stages of the development process, from building a data set to training a model. The company says its mission is to make building AI models less like alchemy and more like a science. Goodfire 声称,Silico 是同类产品中首款现成的工具,能够帮助开发者调试从构建数据集到训练模型等开发过程的所有阶段。该公司表示,其使命是让构建 AI 模型的过程不再像“炼金术”,而更像是一门科学。
Sure, LLMs like ChatGPT and Gemini can do amazing things. But nobody knows exactly how or why they work, and that can make it hard to fix their flaws or block unwanted behaviors. “We saw this widening gap between how well models were understood and just how widely they were being deployed,” Goodfire’s CEO, Eric Ho, tells MIT Technology Review in an exclusive chat ahead of Silico’s release. “I think the dominant feeling in every single major frontier lab today is that you just need more scale, more compute, more data, and then you get AGI [artificial general intelligence] and nothing else matters. And we’re saying no, there’s a better way.” 诚然,像 ChatGPT 和 Gemini 这样的大语言模型(LLM)确实能完成惊人的任务。但没有人确切知道它们是如何或为何运作的,这使得修复其缺陷或阻止不良行为变得困难。“我们发现,人们对模型的理解程度与这些模型被广泛部署的程度之间存在着巨大的鸿沟,”Goodfire 首席执行官 Eric Ho 在 Silico 发布前接受《麻省理工科技评论》独家采访时表示。“我认为,当今每一家主要前沿实验室的主流观点都是:只要增加规模、算力和数据,就能实现 AGI(通用人工智能),其他都不重要。而我们认为,不,还有更好的方法。”
Goodfire is one of a small handful of companies, including industry leaders Anthropic, OpenAI, and Google DeepMind, pioneering a technique known as mechanistic interpretability, which aims to understand what goes on inside an AI model when it carries out a task by mapping its neurons and the pathways between them. (MIT Technology Review picked mechanistic interpretability as one of its 10 Breakthrough Technologies of 2026.) Goodfire 是少数几家率先采用“机械可解释性”(mechanistic interpretability)技术的公司之一,其他公司还包括 Anthropic、OpenAI 和 Google DeepMind 等行业领军者。该技术旨在通过绘制 AI 模型的神经元及其连接路径,来理解模型在执行任务时内部发生了什么。(《麻省理工科技评论》将机械可解释性评为 2026 年十大突破性技术之一。)
Goodfire wants to use this approach not only to audit models—that is, studying those that have already been trained—but to help design them in the first place. “We want to remove the trial and error and turn training models into precision engineering,” says Ho. “And that means exposing the knobs and dials so that you can actually use them during the training process.” Goodfire 不仅希望利用这种方法来审计模型(即研究已经训练好的模型),还希望从一开始就辅助模型设计。“我们希望消除试错过程,将模型训练转变为精密工程,”Ho 说道。“这意味着要将内部的‘旋钮和刻度盘’暴露出来,以便你在训练过程中能够真正使用它们。”
Goodfire has already used its techniques and tools to tweak the behaviors of LLMs—for example, reducing the number of hallucinations they produce. With Silico, the company is now packaging up many of those in-house techniques and shipping them as a product. The tool uses agents to automate much of the complex work. “Agents are now strong enough to do a lot of the interpretability work that we were doing using humans,” says Ho. “That was kind of the gap that needed to be bridged before this was actually a viable platform that customers could use themselves.” Goodfire 此前已经利用其技术和工具调整了大语言模型的行为,例如减少了模型产生的幻觉。现在,通过 Silico,该公司将许多内部技术打包并作为产品推出。该工具利用智能体(agents)来自动化处理大部分复杂工作。“智能体现在已经足够强大,可以完成许多我们过去需要人工参与的可解释性工作,”Ho 说。“在将其打造为一个客户可以自行使用的可行平台之前,这正是需要填补的空白。”
Leonard Bereska, a researcher at the University of Amsterdam who has worked on mechanistic interpretability, thinks Silico looks like a useful tool. But he pushes back on Goodfire’s loftier aspirations. “In reality, they are adding precision to the alchemy,” he says. “Calling it engineering makes it sound more principled than it is.” 阿姆斯特丹大学的研究员 Leonard Bereska 一直从事机械可解释性研究,他认为 Silico 看起来是一个有用的工具。但他对 Goodfire 更宏大的愿景持保留意见。“实际上,他们只是在炼金术中增加了一些精度,”他说。“将其称为工程学,听起来比实际情况更具原则性。”
Mapping models
绘制模型图谱
Silico lets you zoom in on specific parts of a trained model, such as individual neurons or groups of neurons, and run experiments to see what those neurons do. (Assuming you have access to the model’s inner workings. Most people won’t be able to use Silico to poke around inside ChatGPT or Gemini, but you can use it to look at the parameters inside many open-source models.) You can then check what inputs make different neurons fire, and trace pathways upstream and downstream of a neuron to see how other neurons affect it and how it affects other neurons in turn. Silico 允许你放大已训练模型的特定部分,例如单个神经元或神经元组,并运行实验来观察这些神经元的功能。(前提是你能够访问模型的内部运作机制。大多数人无法使用 Silico 去探究 ChatGPT 或 Gemini 的内部,但你可以用它来查看许多开源模型内部的参数。)你可以检查哪些输入会触发不同的神经元,并追踪神经元的上游和下游路径,以观察其他神经元如何影响它,以及它反过来如何影响其他神经元。
For example, Goodfire found one neuron inside the open-source model Qwen 3 that was associated with the so-called trolley problem. Activating this neuron changed the model’s responses, making it frame its outputs as explicit moral dilemmas. “When this neuron’s active, all sorts of weird things happen,” says Ho. 例如,Goodfire 在开源模型 Qwen 3 中发现了一个与所谓的“电车难题”相关的神经元。激活该神经元会改变模型的反应,使其将输出内容表述为明确的道德困境。“当这个神经元被激活时,各种奇怪的事情就会发生,”Ho 说。
Pinpointing the source of odd behavior like this is now pretty standard practice. But Goodfire wants to make it easier to adjust that behavior. Using Silico, developers can now adjust the parameters connected to individual neurons to boost or suppress certain behaviors. In another example, Goodfire researchers asked a model whether a company should disclose that its AI behaves deceptively in 0.3% of cases, affecting 200 million users. The model said no, citing the negative business impact of such a disclosure. By looking inside the model, the researchers found that boosting neurons that were found to be associated with transparency and disclosure flipped the answer from no to yes nine out of 10 times. “The model already had the ethical reasoning circuitry, but it was being outweighed by the commercial risk assessment,” says Ho. 定位这类异常行为的来源现在已是相当标准的操作。但 Goodfire 希望让调整这种行为变得更容易。使用 Silico,开发者现在可以调整与单个神经元相连的参数,以增强或抑制某些行为。在另一个例子中,Goodfire 的研究人员询问模型:一家公司是否应该披露其 AI 在 0.3% 的情况下存在欺骗行为(影响 2 亿用户)。模型回答“不应该”,理由是这种披露会带来负面的商业影响。通过深入模型内部,研究人员发现,增强那些与透明度和披露相关的神经元,可以将回答从“不”变为“是”,成功率达到十分之九。“模型本身已经具备了道德推理电路,但它被商业风险评估所压制了,”Ho 说。
Tweaking the values of a model in this way is just one approach. Silico can also help steer the training process by filtering out certain training data to avoid setting unwanted values for certain parameters in the first place. For example, many models will tell you that 9.11 is greater than 9.9. Looking inside a model to see what’s going on might reveal that it is being influenced by neurons associated with the Bible, in which verse 9.9 comes before 9.11, or by code repositories where consecutive updates are numbered 9.9, 9.10, 9.11 and so on. Using this information, the model can be retrained to make it avoid its “Bible” neurons when doing math. 以这种方式调整模型数值只是方法之一。Silico 还可以通过过滤掉特定的训练数据来引导训练过程,从而从源头上避免为某些参数设置不想要的数值。例如,许多模型会告诉你 9.11 大于 9.9。深入模型内部查看其运作机制,可能会发现它受到了与《圣经》相关神经元的影响(在《圣经》中,9.9 节出现在 9.11 节之前),或者受到了代码库中连续更新编号(如 9.9, 9.10, 9.11 等)的影响。利用这些信息,模型可以进行重新训练,使其在进行数学运算时避开这些“圣经”神经元。
By releasing Silico, Goodfire wants to put techniques previously available to a few top labs into the hands of smaller firms and research teams that want to build their own model or adapt an open-source one. The tool will be available for a fee determined on a case-by-case basis according to customers’ requirements (Goodfire declined to give specific pricing details). “If we can make training models a lot more like building software, there’s no reason why there can’t be many more companies designing models that fit their needs,” says Ho. Bereska agrees that tools like Silico could help firms build more trustworthy. 通过发布 Silico,Goodfire 希望将此前仅少数顶级实验室掌握的技术,交到那些想要构建自有模型或适配开源模型的小型公司和研究团队手中。该工具将根据客户需求按个案收费(Goodfire 拒绝透露具体定价细节)。“如果我们能让模型训练变得更像构建软件,那么完全有理由相信,会有更多的公司能够设计出符合自身需求模型,”Ho 说。Bereska 也认同,像 Silico 这样的工具可以帮助企业构建更值得信赖的模型。