Subjective Portrait Region Cropping in Landscape Videos with Temporal Annotation Smoothing

基于时间标注平滑的横屏视频主观人像区域裁剪

Abstract: With the rise of mobile video consumption on diverse handheld display resolutions and orientation modes, altering videos to aspect ratios poses challenges. Static cropping and border padding often compromises visual quality, while warping may distort a video’s intended meaning. Here we advocate for a more effective approach: cropping significant regions within video frames in a temporal manner, while minimizing distortion and preserving essential content.

摘要： 随着移动视频在各种手持设备分辨率和显示模式下的消费日益增长，将视频调整为不同长宽比面临着诸多挑战。静态裁剪和边框填充往往会损害视觉质量，而变形处理则可能扭曲视频的原始意图。在此，我们提倡一种更有效的方法：以时间维度对视频帧中的重要区域进行裁剪，同时最大限度地减少失真并保留核心内容。

One barrier to solving this problem is the lack of sufficiently large-scale database devoted to informing these tasks. Towards filling this gap, we introduce the LIVE-YouTube Video Cropping (LIVE-YT VC) database, featuring 1800 videos, annotated by 90 human subjects. Using videos sourced from the YouTube-UGC and LSVQ Databases, this new resource is the largest publicly-available subjective video portrait region cropping database.

解决这一问题的一个障碍是缺乏足够大规模的数据库来为这些任务提供支持。为了填补这一空白，我们推出了 LIVE-YouTube 视频裁剪 (LIVE-YT VC) 数据库，其中包含 1800 个视频，并由 90 名人类受试者进行标注。该资源利用来自 YouTube-UGC 和 LSVQ 数据库的视频，是目前公开可用的最大规模主观视频人像区域裁剪数据库。

We also introduce a post-processed version of the database, called LIVE-YT VC++, whereby a novel intra-frame temporal filter was deployed to smooth subjective annotations within each video. We demonstrate the usefulness of this new data resource using the SmartVidCrop algorithm and state-of-the-art video grounding models, in hopes of establishing our subjective dataset as a benchmark for future research.

我们还引入了该数据库的后处理版本，称为 LIVE-YT VC++，其中部署了一种新颖的帧内时间滤波器，用于平滑每个视频内的主观标注。我们利用 SmartVidCrop 算法和最先进的视频定位模型证明了这一新数据资源的实用性，并希望将我们的主观数据集确立为未来研究的基准。

Our contributions offer a resource for advancing video aspect ratio transformation models towards ensuring that reshaped mobile-friendly video content retains its quality and meaning. Since our labels bear resemblances to video saliency annotations, we also conducted an additional analysis to explore the similarity between our labels and video saliency predictions. Finally, we repurposed state-of-the-art video grounding models for aspect ratio change tasks, and fine-tuned them on our dataset. As a service to the research community, we plan to open source the project.

我们的贡献为推进视频长宽比转换模型提供了一种资源，旨在确保重塑后的移动端友好视频内容能够保持其质量和含义。由于我们的标签与视频显著性标注有相似之处，我们还进行了额外的分析，以探讨我们的标签与视频显著性预测之间的相似性。最后，我们将最先进的视频定位模型重新用于长宽比变换任务，并在我们的数据集上进行了微调。作为对研究社区的服务，我们计划将该项目开源。