Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures
Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures
让图表“讲述”它们的故事!基于论文内容的复杂科学图表视频生成技术
Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights—a capability missing from current video generation systems and benchmarks. 科学图表将复杂的流程压缩在单一画面中,但要理解它们,需要结合论文内容进行分步讲解,并与视觉重点相对应——而这正是当前视频生成系统和基准测试所缺失的能力。
To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. 为了解决这一问题,我们引入了“基于论文的图表转视频生成”技术:即根据一张图表及其对应的论文,生成带有解说且与图表区域相对应的引导视频。
We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. 我们提出了 MINARD(通过区域分解对架构进行多模态解说),这是一个能够生成基于论文的解说,并将其按顺序定位到图表具体区域的流水线。
We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. 我们还发布了 FigTalk,这是一个包含新的序列级和组件级定位指标的基准测试。
On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation. 在 FigTalk 测试中,MINARD 生成了类人且忠实于论文内容的解说,并且在自动评估和人工评估中,其在“基于解说的图表空间定位”任务上的表现均优于现有方法。