Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

让图表“讲述”它们的故事！基于论文内容的复杂科学图表视频生成技术

Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights—a capability missing from current video generation systems and benchmarks. 科学图表将复杂的流程压缩在单一画面中，但要理解它们，需要结合论文内容进行分步讲解，并与视觉重点相对应——而这正是当前视频生成系统和基准测试所缺失的能力。

To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. 为了解决这一问题，我们引入了“基于论文的图表转视频生成”技术：即根据一张图表及其对应的论文，生成带有解说且与图表区域相对应的引导视频。

We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. 我们提出了 MINARD（通过区域分解对架构进行多模态解说），这是一个能够生成基于论文的解说，并将其按顺序定位到图表具体区域的流水线。

We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. 我们还发布了 FigTalk，这是一个包含新的序列级和组件级定位指标的基准测试。

On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation. 在 FigTalk 测试中，MINARD 生成了类人且忠实于论文内容的解说，并且在自动评估和人工评估中，其在“基于解说的图表空间定位”任务上的表现均优于现有方法。