p-e-w / heretic

Heretic: Fully automatic censorship removal for language models

Heretic: Fully automatic censorship removal for language models Heretic 是一款能够从基于 Transformer 的语言模型中移除审查(即“安全对齐”)的工具,且无需昂贵的后期训练。它结合了定向消融(Directional Ablation,又称“Abliteration”,Arditi 等人 2024,Lai 2025 (1, 2))的高级实现,以及由 Optuna 驱动的基于 TPE 的参数优化器。

Heretic: Fully automatic censorship removal for language models Heretic 是一款能够从基于 Transformer 的语言模型中移除审查(即“安全对齐”)的工具,且无需昂贵的后期训练。它结合了定向消融(Directional Ablation,又称“Abliteration”,Arditi 等人 2024,Lai 2025 (1, 2))的高级实现,以及由 Optuna 驱动的基于 TPE 的参数优化器。

This approach enables Heretic to work completely automatically. Heretic finds high-quality abliteration parameters by co-minimizing the number of refusals and the KL divergence from the original model. This results in a decensored model that retains as much of the original model’s intelligence as possible. 这种方法使 Heretic 能够完全自动化运行。Heretic 通过协同最小化拒绝回答的次数以及与原始模型之间的 KL 散度,来寻找高质量的消融参数。这最终产生了一个去审查模型,并尽可能多地保留了原始模型的智能。

Using Heretic does not require an understanding of transformer internals. In fact, anyone who knows how to run a command-line program can use Heretic to decensor language models. Heretic supports most dense models, including many multimodal models, several different MoE architectures, and even some hybrid models like Qwen3.5. Pure state-space models and certain other research architectures are not yet supported out of the box. 使用 Heretic 不需要了解 Transformer 的内部原理。事实上,任何知道如何运行命令行程序的人都可以使用 Heretic 来对语言模型进行去审查。Heretic 支持大多数稠密模型,包括许多多模态模型、几种不同的 MoE 架构,甚至包括像 Qwen3.5 这样的混合模型。纯状态空间模型和其他一些研究性架构目前尚不支持开箱即用。

Running unsupervised with the default configuration, Heretic can produce decensored models that rival the quality of abliterations created manually by human experts: 在默认配置下进行无监督运行时,Heretic 能够产生与人类专家手动消融效果相媲美的去审查模型:

ModelRefusals for “harmful” promptsKL divergence from original model for “harmless” prompts
google/gemma-3-12b-it (original)97/1000 (by definition)
mlabonne/gemma-3-12b-it-abliterated-v23/1001.04
huihui-ai/gemma-3-12b-it-abliterated3/1000.45
p-e-w/gemma-3-12b-it-heretic (ours)3/1000.16

The Heretic version, generated without any human effort, achieves the same level of refusal suppression as other abliterations, but at a much lower KL divergence, indicating less damage to the original model’s capabilities. (You can reproduce those numbers using Heretic’s built-in evaluation functionality, e.g. heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic. Note that the exact values might be platform- and hardware-dependent. The table above was compiled using PyTorch 2.8 on an RTX 5090.) Heretic 版本在无需任何人工干预的情况下生成,达到了与其他消融方法相同的拒绝抑制水平,但 KL 散度显著更低,这意味着对原始模型能力的损害更小。(你可以使用 Heretic 内置的评估功能复现这些数据,例如 heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-heretic。请注意,具体数值可能因平台和硬件而异。上表是使用 PyTorch 2.8 在 RTX 5090 上编译的。)

Of course, mathematical metrics and automated benchmarks never tell the whole story, and are no substitute for human evaluation. Models generated with Heretic have been well-received by users (links and emphasis added): 当然,数学指标和自动化基准测试并不能说明全部情况,也无法替代人类评估。由 Heretic 生成的模型受到了用户的好评(已添加链接和强调):

“I was skeptical before, but I just downloaded GPT-OSS 20B Heretic model and holy shit. It gives properly formatted long responses to sensitive topics, using the exact uncensored words that you would expect from an uncensored model, produces markdown format tables with details and whatnot. Looks like this is the best abliterated version of this model so far…” “我之前持怀疑态度,但我刚刚下载了 GPT-OSS 20B Heretic 模型,天哪。它能针对敏感话题给出格式规范的长回复,使用你期望从无审查模型中看到的准确词汇,还能生成带有细节的 Markdown 格式表格等等。看起来这是目前该模型最好的消融版本……”

“Heretic GPT 20b seems to be the best uncensored model I have tried yet. It doesn’t destroy a the model’s intelligence and it is answering prompts normally would be rejected by the base model.” “Heretic GPT 20b 似乎是我尝试过的最好的无审查模型。它没有破坏模型的智能,并且能够回答基础模型通常会拒绝的提示词。”

“[Qwen3-4B-Instruct-2507-heretic] Has been the best unquantized abliterated model that I have been able to run on 16gb vram.” “[Qwen3-4B-Instruct-2507-heretic] 是我能在 16GB 显存上运行的最好的未量化消融模型。”

Heretic models have also been independently benchmarked using standard metrics like MMLU and GSM8K, and have been found to compare favorably with models produced by competing abliteration tools. The community has created and published well over 3000 models with Heretic. Heretic 模型也已通过 MMLU 和 GSM8K 等标准指标进行了独立基准测试,结果显示其表现优于竞争对手的消融工具所生成的模型。社区已经使用 Heretic 创建并发布了超过 3000 个模型。

Usage

使用方法

Prepare a Python 3.10+ environment with PyTorch 2.2+ installed as appropriate for your hardware. Then run: 准备一个 Python 3.10+ 环境,并根据你的硬件安装相应的 PyTorch 2.2+。然后运行:

pip install -U heretic-llm heretic Qwen/Qwen3-4B-Instruct-2507

Replace Qwen/Qwen3-4B-Instruct-2507 with whatever model you want to decensor. 将 Qwen/Qwen3-4B-Instruct-2507 替换为你想要去审查的任何模型。

Important 重要提示

While PyTorch 2.2 is the minimum version of PyTorch needed for Heretic to work, some models and configurations might require features only found in later versions. For example, loading MXFP4-quantized models like gpt-oss uses torch.accelerator, which was added in PyTorch 2.6. 虽然 PyTorch 2.2 是 Heretic 运行所需的最低版本,但某些模型和配置可能需要更高版本中才有的功能。例如,加载像 gpt-oss 这样的 MXFP4 量化模型需要使用 PyTorch 2.6 中引入的 torch.accelerator

Tip 提示

Heretic uses uv for dependency management, and the repository includes a uv.lock file pinning every package version. If you already use uv (and you probably should!), you can just clone the repo and run Heretic with uv run heretic, which ensures that your dependencies match those used by the developers, improving reliability and security. Heretic 使用 uv 进行依赖管理,仓库中包含一个锁定所有包版本的 uv.lock 文件。如果你已经在使用 uv(你应该用!),你可以直接克隆仓库并使用 uv run heretic 运行 Heretic,这能确保你的依赖项与开发者所使用的版本一致,从而提高可靠性和安全性。

The process is fully automatic and does not require configuration; however, Heretic has a variety of configuration parameters that can be changed for greater control. Run heretic --help to see available command-line options, or look at config.default.toml if you prefer to use a configuration file. 整个过程是完全自动化的,不需要配置;不过,Heretic 提供了多种配置参数,可以更改以获得更好的控制。运行 heretic --help 查看可用的命令行选项,或者如果你更喜欢使用配置文件,可以查看 config.default.toml

At the start of a program run, Heretic benchmarks the system to determine the optimal batch size to make the most of the available hardware. On an RTX 3090, with the default configuration, decensoring Qwen3-4B-Instruct-2507 takes about 20-30 minutes. Note that Heretic supports model quantization with bitsandbytes, which can drastically reduce the amount of VRAM required to process models. Set the quantization option to bnb_4bit to enable quantization. 在程序运行开始时,Heretic 会对系统进行基准测试,以确定最佳批处理大小,从而充分利用现有硬件。在 RTX 3090 上,使用默认配置,对 Qwen3-4B-Instruct-2507 进行去审查大约需要 20-30 分钟。请注意,Heretic 支持使用 bitsandbytes 进行模型量化,这可以大幅减少处理模型所需的显存。将量化选项设置为 bnb_4bit 即可启用量化。

After Heretic has finished decensoring a model, you are given the option to save the model, upload it to Hugging Face, chat with it to test how well it works, run standard benchmarks on it, or any combination of those actions. 在 Heretic 完成模型去审查后,你可以选择保存模型、将其上传到 Hugging Face、通过对话测试其效果、运行标准基准测试,或执行这些操作的任意组合。

Research features

研究功能

In addition to its primary function of removing model censorship, Heretic also provides features designed to support research into the semantics of model internals (interpretability). To use those features, you need to install Heretic with the optional research extra: 除了移除模型审查这一主要功能外,Heretic 还提供了旨在支持模型内部语义研究(可解释性)的功能。要使用这些功能,你需要安装带有可选 research 扩展的 Heretic:

pip install -U heretic-llm[research]

This gives you access to the following functionality: 这使你可以使用以下功能:

Generate plots of residual vectors by passing --plot-residuals. When run with this flag, Heretic will: 通过传递 --plot-residuals 生成残差向量图。当使用此标志运行时,Heretic 将:

  • Compute residual vectors (hidden states) for the first output token, for each transformer layer, for both “harmful” and “harmless” prompts.
  • 计算“有害”和“无害”提示词在每个 Transformer 层中第一个输出 token 的残差向量(隐藏状态)。
  • Perform a PaCMAP projection from residual space to 2D-space.
  • 执行从残差空间到二维空间的 PaCMAP 投影。
  • Left-right align the projections of “harmful”/“harmful” residuals by their geometric media.
  • 通过几何中位数对“有害”/“无害”残差的投影进行左右对齐。