Beyond LoRA: Is Sparsity-Induced Adaptation Better?
Beyond LoRA: Is Sparsity-Induced Adaptation Better?
超越 LoRA:稀疏诱导的微调方法更好吗?
Abstract: Low-rank adaptation (LoRA) and its variants provide a memory- and compute-efficient alternative to full fine-tuning of pre-trained models. However, questions remain about the comparative generalizability of these approaches and how the structural restrictions on low-rank updates preserve effective adaptation performance.
摘要: 低秩自适应(LoRA)及其变体为预训练模型的全参数微调提供了一种内存和计算效率更高的替代方案。然而,关于这些方法的比较泛化能力,以及低秩更新的结构限制如何保持有效的微调性能,目前仍存在疑问。
We present a historical framing, covering the past (full fine-tuning and original LoRA), the present (different variants of LoRA), and propose simpler, cheaper, parameter-efficient extensions by inducing sparsity within existing LoRA variants: Cheap LoRA (cLA), training a single low-rank factor with the other fixed (deterministically or, in its randomized variant, stochastically), and the chained circulant variant, ${c}^3$LA.
我们提出了一个历史框架,涵盖了过去(全参数微调和原始 LoRA)、现在(LoRA 的不同变体),并提出了一些更简单、更廉价、参数效率更高的扩展方法,即通过在现有 LoRA 变体中引入稀疏性:Cheap LoRA (cLA),即训练单个低秩因子而固定另一个(确定性地,或在其随机变体中随机地),以及链式循环变体 ${c}^3$LA。
We frame cLA as a structured instance of asymmetric LoRA, serving as a controlled column-subspace restriction of full fine-tuning. We derive information-theoretic generalization error bounds for these variants, marking one of the first endeavors in this area.
我们将 cLA 视为非对称 LoRA 的一种结构化实例,作为全参数微调中受控的列子空间限制。我们推导了这些变体的信息论泛化误差界,这是该领域最早的尝试之一。
Empirically, we evaluate 11 fine-tuning methods across 10 pre-trained models and 14 datasets, analyzing the fine-tuned models’ performance and generalization using tools such as loss landscapes and spectral analysis.
在实证方面,我们评估了 10 个预训练模型和 14 个数据集上的 11 种微调方法,并利用损失函数地形(loss landscapes)和谱分析等工具分析了微调模型的性能和泛化能力。
Despite the sensitivity of fine-tuned models to the pre-trained model, datasets, and other factors, our study suggests that restricting LoRA-based PEFT methods’ adaptation to a sparse, structured column space remains competitive across tasks with their parameter-matched baselines while reducing up to 10% training time and peak GPU memory up to 15%, even with a naïve, non-optimized, sparse implementation.
尽管微调模型对预训练模型、数据集和其他因素较为敏感,但我们的研究表明,将基于 LoRA 的参数高效微调(PEFT)方法的适应性限制在稀疏的结构化列空间中,在各项任务中仍能与参数匹配的基线保持竞争力,同时即使在朴素、未优化的稀疏实现下,也能减少高达 10% 的训练时间和 15% 的峰值 GPU 内存占用。
Our theoretical and empirical generalization measures provide a more consistent and principled approach to their cost-effective adaptation than commonly used analytical tools.
与常用的分析工具相比,我们的理论和实证泛化度量为实现高性价比的微调提供了一种更一致、更具原则性的方法。