VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
VibeToken:扩展用于动态分辨率生成的一维图像分词器与自回归模型
We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. 我们介绍了一种高效且与分辨率无关的自回归(AR)图像合成方法,该方法可泛化至任意分辨率和长宽比,从而缩小了其与大规模扩散模型之间的差距。
At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. 其核心是 VibeToken,这是一种基于一维 Transformer 的新型分辨率无关图像分词器。它将图像编码为 32 到 256 个动态且可由用户控制的 Token 序列,实现了业界领先的效率与性能平衡。
Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. 基于 VibeToken,我们提出了 VibeToken-Gen,这是一个类别条件自回归生成器。它开箱即用地支持任意分辨率,同时所需的计算资源显著减少。
Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. 值得注意的是,VibeToken-Gen 仅使用 64 个 Token 即可合成 1024x1024 的图像,并达到 3.94 的 gFID;相比之下,目前基于扩散模型的先进方案需要 1,024 个 Token 才能达到 5.87 的 gFID。
In contrast to fixed-resolution AR models such as LlamaGen — whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) — VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. 与 LlamaGen 等固定分辨率的自回归模型不同(其推理 FLOPs 随分辨率呈二次方增长,在 1024x1024 分辨率下高达 11T FLOPs),VibeToken-Gen 保持恒定的 179G FLOPs(效率提升 63.4 倍),且不受分辨率影响。
We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases. 我们希望 VibeToken 能够助力自回归视觉生成模型在生产环境中的广泛应用。