Can Editing 1 Neuron Fix Repetition Loops in LLMs?
Can Editing 1 Neuron Fix Repetition Loops in LLMs?
修改 1 个神经元能修复大语言模型(LLM)的重复循环问题吗?
Abstract: Yes. Can it cure doom loops? Probably not. The Gemma 4 instruction-tuned models share a reproducible failure: on long factual enumeration prompts, such as listing every episode of a TV series, the 88 IAU constellations, or the 151 original Pokemon, they collapse into repetition, either a tight verbatim loop or a list whose entries decay onto a single answer. These loops occur at rates as high as 95% and survive prompt rewording, inference-engine changes, and most sampling adjustments.
摘要: 可以。但它能治愈“死循环”吗?可能不行。Gemma 4 指令微调模型存在一个可复现的缺陷:在处理长篇事实枚举提示词时(例如列出某部电视剧的所有剧集、国际天文学联合会定义的 88 个星座,或最初的 151 只宝可梦),模型会陷入重复,表现为紧密的逐字循环,或者列表条目最终退化为同一个答案。这些循环的发生率高达 95%,且无法通过重写提示词、更换推理引擎或调整大多数采样参数来解决。
In this paper we explore whether this behavior is localized enough to remove by weight edits. To localize the cause, we use per-layer ablation and per-neuron attribution, then confirm the strongest candidates with full-generation sweeps. The loops trace to a small set of MLP neurons (or, in the 26B-A4B Mixture-of-Experts model, a few routed experts) which we suppress with static weight edits. These “surgeries” can be as small as a single sign-inverted neuron (in the E2B model).
在本文中,我们探讨了这种行为是否足够局部化,从而可以通过权重编辑来消除。为了定位原因,我们使用了逐层消融(per-layer ablation)和逐神经元归因(per-neuron attribution)技术,并通过全生成扫描确认了最可能的候选对象。研究发现,这些循环源于一小部分 MLP 神经元(在 26B-A4B 混合专家模型中,则源于少数被路由的专家),我们通过静态权重编辑抑制了它们。这些“手术”规模极小,有时仅需反转单个神经元的符号(在 E2B 模型中)。
The size of the effective edits grows with model scale, but in all cases, the loop patterns can be addressed at normal generation budgets while preserving general-purpose benchmark scores. However, the edits do not solve everything: we also study longer thinking budgets, where the two larger models most visibly enter doom looping, i.e. a non-convergent regime in which the model self-corrects in circles over a fact it cannot recall, exhausting the budget without committing to a final answer.
有效编辑的规模会随着模型规模的增大而增加,但在所有情况下,这些循环模式都可以在正常的生成预算内得到解决,同时保持通用基准测试的分数不变。然而,这些编辑并不能解决所有问题:我们还研究了更长的思考预算,在这种情况下,两个较大的模型最明显地进入了“死循环”,即一种非收敛状态——模型在无法回忆起某个事实时,会不断自我纠正并陷入循环,最终耗尽预算却无法给出最终答案。
We show this residual failure is reduced but not eliminated by the same edits, and argue it is fundamentally a knowledge-precision problem rather than a removable circuit; weight surgery can delete a loop, but it cannot supply a missing fact. Our results are both a feasibility demonstration, that is, evidence that a concrete generation pathology can be localized to a few parameters and edited out, and a delineation of where that approach stops.
我们证明,同样的编辑可以减少但不能消除这种残留的故障,并指出这从根本上是一个知识精确度问题,而不是可以通过移除电路解决的问题;权重手术可以删除循环,但无法补充缺失的事实。我们的研究结果既是一项可行性演示(证明了具体的生成病理可以定位到少数参数并被编辑掉),也界定了该方法的局限性。