VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

VibeToken：扩展用于动态分辨率生成的一维图像分词器与自回归模型

We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. 我们介绍了一种高效且与分辨率无关的自回归（AR）图像合成方法，该方法可泛化至任意分辨率和长宽比，从而缩小了其与大规模扩散模型之间的差距。

At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. 其核心是 VibeToken，这是一种基于一维 Transformer 的新型分辨率无关图像分词器。它将图像编码为 32 到 256 个动态且可由用户控制的 Token 序列，实现了业界领先的效率与性能平衡。

Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. 基于 VibeToken，我们提出了 VibeToken-Gen，这是一个类别条件自回归生成器。它开箱即用地支持任意分辨率，同时所需的计算资源显著减少。

Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. 值得注意的是，VibeToken-Gen 仅使用 64 个 Token 即可合成 1024x1024 的图像，并达到 3.94 的 gFID；相比之下，目前基于扩散模型的先进方案需要 1,024 个 Token 才能达到 5.87 的 gFID。

In contrast to fixed-resolution AR models such as LlamaGen — whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) — VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. 与 LlamaGen 等固定分辨率的自回归模型不同（其推理 FLOPs 随分辨率呈二次方增长，在 1024x1024 分辨率下高达 11T FLOPs），VibeToken-Gen 保持恒定的 179G FLOPs（效率提升 63.4 倍），且不受分辨率影响。

We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases. 我们希望 VibeToken 能够助力自回归视觉生成模型在生产环境中的广泛应用。