Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

通过因果归因剪枝技术保持大语言模型的推理性能

Large language models (LLMs) excel at multi-step reasoning but incur substantial inference cost. We introduce Causal Attribution Pruning (CAP), a training-free method that identifies critical attention heads by measuring their causal impact on reasoning tasks and uses these head-level scores to guide fine-grained weight pruning. 大语言模型（LLMs）在多步推理任务中表现出色，但同时也带来了高昂的推理成本。我们引入了因果归因剪枝（Causal Attribution Pruning, CAP），这是一种无需训练的方法。该方法通过衡量注意力头（attention heads）对推理任务的因果影响来识别关键头，并利用这些头级别的分数来指导细粒度的权重剪枝。

For each attention head, CAP estimates the expected performance degradation when the head is masked during forward passes on a small calibration set of reasoning problems. These causal scores are then converted into weight-level importance values for the corresponding projection matrices. 对于每一个注意力头，CAP 会在少量推理问题的校准集上进行前向传播，通过掩盖该注意力头来估算预期的性能下降。随后，这些因果分数会被转化为相应投影矩阵的权重级重要性值。

Unlike magnitude-only or activation-based criteria, CAP’s interventional measurement directly captures each head’s functional contribution, yielding relative accuracy gains of up to 61% over Wanda on ARC-Challenge at 20% sparsity. 与仅基于权重大小或激活值的标准不同，CAP 的干预式测量直接捕捉了每个注意力头的功能贡献。在 20% 的稀疏度下，该方法在 ARC-Challenge 基准测试中相比 Wanda 实现了高达 61% 的相对准确率提升。

We evaluate CAP on GSM8K, StrategyQA, and ARC-Challenge using Llama-3-8B-Instruct and Mistral-7B-Instruct at 10%, 20%, and 50% sparsity. At moderate sparsity (10-20%), CAP improves over Wanda in most model-benchmark configurations, with especially large gains on ARC-Challenge for Llama-3. 我们在 Llama-3-8B-Instruct 和 Mistral-7B-Instruct 模型上，针对 10%、20% 和 50% 的稀疏度，在 GSM8K、StrategyQA 和 ARC-Challenge 基准上对 CAP 进行了评估。在中等稀疏度（10-20%）下，CAP 在大多数模型与基准配置中均优于 Wanda，特别是在 Llama-3 的 ARC-Challenge 测试中表现出显著的性能提升。

Our results suggest that attention-head-level causal attribution can better preserve reasoning performance on downstream benchmarks than correlational pruning criteria at equivalent sparsity, while remaining limited by coarse MLP attribution at 50% sparsity. 研究结果表明，在相同的稀疏度下，注意力头级别的因果归因比相关性剪枝标准能更好地保持下游基准测试中的推理性能，但在 50% 的高稀疏度下，该方法仍受限于粗糙的 MLP 归因。