Decompose, Compare, and Decide: Multimodal LLMs are Implicit Few-Shot Learners

分解、比较与决策：多模态大模型是隐式的少样本学习器

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable abilities when analyzing images, yet translating these capabilities to few-shot image classification remains challenging. To bridge this gap, we present DeCoDe, a simple yet effective technique that enables off-the-shelf MLLMs to act as strong few-shot classifiers without any additional training.

摘要： 多模态大模型（MLLMs）在分析图像时展现出了卓越的能力，然而将这些能力转化为少样本图像分类任务仍然具有挑战性。为了弥补这一差距，我们提出了 DeCoDe，这是一种简单而有效的技术，使现成的 MLLMs 无需任何额外训练即可作为强大的少样本分类器使用。

Our approach builds on the idea of few-shot classification as a set of pairwise image comparisons, decomposing the task into a set of binary decisions. Given a query image and a support image from a candidate class, the MLLM is prompted to decide whether the two images depict the same class. The logit corresponding to an affirmative response is then used as a similarity score to assign the query image to the most likely class.

我们的方法基于将少样本分类视为一系列成对图像比较的理念，将任务分解为一系列二元决策。给定一张查询图像和来自候选类别的支持图像，通过提示（Prompt）让 MLLM 判断这两张图像是否属于同一类别。随后，将对应于肯定回答的 Logit 值作为相似度分数，从而将查询图像分配给最可能的类别。

While this already yields good results, we show that providing additional high-level information, such as the data domain, to the model further improves performance. Our evaluation provides an extensive analysis of various inference variants on a suite of twelve datasets, six established and six newly curated few-shot benchmarks spanning across diverse domains.

虽然这种方法已经取得了良好的效果，但我们证明，向模型提供额外的高层信息（例如数据领域）可以进一步提升性能。我们的评估对十二个数据集上的各种推理变体进行了广泛分析，其中包括六个已有的基准测试和六个新策划的、跨越不同领域的少样本基准测试。

The results show that the proposed simple decomposition technique can turn off-the-shelf MLLMs into powerful few-shot learners, significantly outperforming current state-of-the-art few-shot methods on both standard and novel domains. Code is available at this https URL.

结果表明，所提出的简单分解技术可以将现成的 MLLMs 转变为强大的少样本学习器，在标准领域和新颖领域均显著优于当前最先进的少样本方法。代码已在链接中提供。