Interactive Episodic Memory with User Feedback

带有用户反馈的交互式情景记忆

In episodic memory with natural language queries (EM-NLQ), a user may ask a question (e.g., “Where did I place the mug?”) that requires searching a long egocentric video, captured from the user’s perspective, to find the moment that answers it. 在带有自然语言查询的情景记忆（EM-NLQ）任务中，用户可能会提出一个问题（例如：“我把杯子放在哪儿了？”），这需要搜索一段从用户视角拍摄的长篇第一人称视频，以找到回答该问题的时刻。

However, queries can be ambiguous or incomplete, leading to incorrect responses. Current methods ignore this key aspect and address EM-NLQ in a one-shot setup, limiting their applicability in real-world scenarios. 然而，查询往往存在歧义或不完整，导致模型给出错误的回答。目前的方法忽略了这一关键方面，仅在“一次性”（one-shot）设置下处理 EM-NLQ，限制了其在现实场景中的适用性。

In this work, we address this gap and introduce the Episodic Memory with Questions and Feedback task (EM-QnF). Here, the user can provide feedback on the model’s initial prediction or add more information (e.g., “Before this. I’m looking for the big blue mug not the white one”), helping the model refine its predictions interactively. 在这项工作中，我们填补了这一空白，并引入了“带有问题与反馈的情景记忆”（EM-QnF）任务。在该任务中，用户可以对模型的初始预测提供反馈或补充更多信息（例如：“在这之前。我找的是那个大的蓝色杯子，不是白色的那个”），从而帮助模型以交互方式优化其预测结果。

To this end, we collect datasets for feedback-based interaction and propose a lightweight training scheme that avoids expensive sequential optimization. We also introduce a plug-and-play Feedback ALignment Module (FALM) that enables existing EM-NLQ models to incorporate user feedback effectively. 为此，我们收集了用于基于反馈交互的数据集，并提出了一种轻量级的训练方案，避免了昂贵的序列优化。我们还引入了一个即插即用的反馈对齐模块（FALM），使现有的 EM-NLQ 模型能够有效地整合用户反馈。

Our approach significantly improves over the state of the art on three challenging benchmarks and is better than or competitive with commercial large vision-language models while remaining efficient. Evaluation with human-generated feedback shows that it generalizes well to real-world scenarios. 我们的方法在三个具有挑战性的基准测试中显著优于当前的最先进水平，在保持高效的同时，表现优于或媲美商业化的大型视觉语言模型。通过人类生成的反馈进行的评估表明，该方法在现实场景中具有良好的泛化能力。