StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

StrLoRA：面向多模态大模型（MLLMs）的流式持续视觉指令微调

Abstract: Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task-incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real-world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks.

摘要： 持续视觉指令微调（CVIT）使多模态大模型（MLLMs）能够增量式地获取新能力。然而，现有的 CVIT 方法运行在一种受限的任务增量设置下，即每个训练阶段仅对应一个预定义的任务。这无法反映现实世界的情况，在现实中，数据是以交错且动态演变的连续流形式到达的。

To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk.

为了弥补这一差距，我们引入了流式 CVIT（StrCVIT），这是一种更通用且更符合现实的设置，模型从包含动态任务混合的数据块流中进行学习。在 StrCVIT 中，模型必须同时获取新能力、强化重复出现的能力并减轻遗忘。现有的 CVIT 方法在此场景下表现不佳，因为它们无法可靠地识别或适应每个数据块内异构的任务样本。

We therefore propose StrLoRA, a regularized two-stage expert routing framework. StrLoRA first performs task-aware expert selection using the textual instruction to activate a sparse subset of relevant experts, reducing cross-task interference. It then applies token-wise expert weighting within this subset, where contribution weights are computed via cross-modal attention between local visual tokens and the global instruction representation.

因此，我们提出了 StrLoRA，这是一个正则化的两阶段专家路由框架。StrLoRA 首先利用文本指令进行任务感知的专家选择，以激活相关专家的稀疏子集，从而减少跨任务干扰。随后，它在该子集内应用基于 Token 的专家加权，其中贡献权重通过局部视觉 Token 与全局指令表示之间的跨模态注意力机制计算得出。

To maintain stability across the non-stationary stream, a routing-stability regularization aligns current routing distributions with a historical exponential moving average reference. Extensive experiments on a newly developed StrCVIT benchmark show that StrLoRA substantially outperforms existing methods, effectively enhancing model’s abilities from continuously evolving data streams.

为了在非平稳流中保持稳定性，路由稳定性正则化将当前的路由分布与历史指数移动平均参考值进行对齐。在最新开发的 StrCVIT 基准测试上的大量实验表明，StrLoRA 显著优于现有方法，有效地增强了模型从持续演变的数据流中学习的能力。