MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second
MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second
June 8, 2026 MiMo-V2.5-Pro-UltraSpeed: Pushing 1T-Parameter Model Generation Speed to 1000 TPS 2026年6月8日 MiMo-V2.5-Pro-UltraSpeed:将万亿参数模型生成速度推向每秒 1000 Token
1. Xiaomi MiMo-V2.5-Pro-UltraSpeed: Speed is the Ultimate Edge
1. 小米 MiMo-V2.5-Pro-UltraSpeed:速度是终极优势
From the first roaring racer of the combustion age to the sonic boom that shattered the sound barrier, humanity’s hunger for speed is written into our very DNA. The speed of AI reasoning is no different — it defines the boundaries of intelligence itself. When a model is fast enough, it ceases to be a tool you wait on and becomes an extension of your own thinking: responding in real time, iterating in an instant, collaborating without friction. 从燃油时代的轰鸣赛车到突破音障的音爆,人类对速度的渴望早已刻入基因。AI 推理的速度亦是如此——它定义了智能本身的边界。当模型足够快时,它就不再是一个让你等待的工具,而成为了你思维的延伸:实时响应、瞬间迭代、无缝协作。
Today, we are thrilled to release Xiaomi MiMo-V2.5-Pro-UltraSpeed in collaboration with TileRT, breaking the 1000 tokens/s decode speed on a 1-trillion-parameter model for the first time! 今天,我们非常激动地宣布与 TileRT 合作推出 Xiaomi MiMo-V2.5-Pro-UltraSpeed,首次在万亿参数模型上突破了每秒 1000 Token 的解码速度!
MiMo-V2.5-Pro UltraSpeed real-time generation speed comparison (up to ~1200 tokens/s) MiMo-V2.5-Pro UltraSpeed 实时生成速度对比(最高可达约 1200 tokens/s)
2. Limited-Time Access · Application-Based
2. 限时访问 · 申请制
The MiMo-V2.5-Pro-UltraSpeed API launches simultaneously at a limited-time promotional price — 3× the cost of MiMo-V2.5-Pro, but delivering approximately 10× the generation speed! 3× the price, 10× the output experience. (API only; Token Plan not supported.) MiMo-V2.5-Pro-UltraSpeed API 同步上线,并推出限时优惠价格——价格仅为 MiMo-V2.5-Pro 的 3 倍,却带来了约 10 倍的生成速度!3 倍的价格,10 倍的输出体验。(仅限 API;不支持 Token 套餐。)
Due to limited high-speed inference resources, MiMo-V2.5-Pro-UltraSpeed will be available through an application-based, limited-time window. Approved users can access the API during the trial period, available only from June 9 to June 23, 2026, 23:59 (Beijing Time, UTC+8 / 08:59 PDT). 由于高速推理资源有限,MiMo-V2.5-Pro-UltraSpeed 将通过申请制在限时窗口内开放。获批用户可在 2026 年 6 月 9 日至 6 月 23 日 23:59(北京时间)的试用期内访问 API。
How to Apply 如何申请
API platform: platform.xiaomimimo.com/ultraspeed. Trial slots are limited — submission does not guarantee approval. We will prioritize enterprises and professional developers with genuine business needs. For standard model access, please follow the MiMo-V2.5 model series. For in-depth business partnerships for the UltraSpeed model, contact business-mimo@xiaomi.com. API 平台:platform.xiaomimimo.com/ultraspeed。试用名额有限,提交申请不代表一定通过。我们将优先考虑有真实业务需求的企业和专业开发者。如需标准模型访问权限,请关注 MiMo-V2.5 模型系列。如需针对 UltraSpeed 模型进行深度商业合作,请联系 business-mimo@xiaomi.com。
Chat Experience (Free During Trial) 聊天体验(试用期间免费)
Approved users will receive free Chat access valid within the two-week window. Entry point: ultraspeed.xiaomimimo.com 获批用户将在两周窗口期内获得免费的聊天访问权限。入口:ultraspeed.xiaomimimo.com
To ensure quality and fairness under resource constraints, the following rules apply: each account may enter the queue up to 10 times per day; each session is capped at 30 minutes; sessions idle for more than 5 minutes will be automatically released. 为在资源受限的情况下确保质量与公平,适用以下规则:每个账号每天最多可排队 10 次;每次会话限时 30 分钟;闲置超过 5 分钟的会话将自动释放。
3. 1000 tokens/s: Not Just Fast, But a Paradigm Shift
3. 每秒 1000 Token:不仅是速度,更是范式转移
At the trillion-parameter (1T) scale, breaking 1000 tps is far more than a faster typewriter — it fundamentally disrupts AI application paradigms. 在万亿参数(1T)规模下,突破 1000 tps 远不止是打字机变快了,它从根本上颠覆了 AI 的应用范式。
First, speed itself begins to transmute into intelligence. Previously, when facing a hard problem, you could only “wait for one answer and pray it’s correct.” Now, within the same wall-clock time, the model can run dozens of reasoning paths in parallel (Best-of-N / Tree Search), automatically verifying and self-correcting in the background — using raw speed to generate depth of thought, directly elevating reasoning quality. 首先,速度本身开始转化为智能。过去,面对难题时,你只能“等待一个答案并祈祷它是正确的”。现在,在同样的挂钟时间内,模型可以并行运行数十条推理路径(Best-of-N / 树搜索),在后台自动验证并自我修正——利用原始速度生成思维深度,直接提升推理质量。
Second, it completely unleashes the productivity ceiling of Coding Agents. Before, having AI write code meant developers painfully waiting in front of screens, bottlenecked by inference latency. At 1000 tps, code generation speed and production efficiency undergo a paradigm-level acceleration. 其次,它彻底释放了编程智能体(Coding Agents)的生产力上限。以前,让 AI 写代码意味着开发者必须在屏幕前痛苦等待,受限于推理延迟。在 1000 tps 的速度下,代码生成速度和生产效率实现了范式级的加速。
Most importantly, trillion-parameter models can now enter real-time decision loops. Millisecond-level “think-respond” cycles allow 1T flagship models to seamlessly plug into time-critical scenarios — high-frequency quantitative trading signal generation, instant anti-fraud interception, intelligent bidding, and real-time interactive dialogue. And when this power is brought to surgical assistance and medical imaging analysis in life-or-death situations, AI speed is no longer just a metric of efficiency — it becomes a chip in the race against death. On the operating table, every second AI saves in completing lesion analysis and risk prediction gives the surgeon one more degree of freedom. This deepens our conviction that the ultimate significance of speed is not merely boosting productivity, but enabling technology to help humanity live better. 最重要的是,万亿参数模型现在可以进入实时决策循环。毫秒级的“思考-响应”周期使 1T 旗舰模型能够无缝接入对时间敏感的场景——高频量化交易信号生成、即时反欺诈拦截、智能竞价以及实时交互对话。当这种能力被带到生死攸关的手术辅助和医学影像分析中时,AI 速度不再仅仅是效率指标,它成为了与死神赛跑的筹码。在手术台上,AI 完成病灶分析和风险预测所节省的每一秒,都为外科医生多争取了一份自由度。这加深了我们的信念:速度的终极意义不仅在于提升生产力,更在于让技术助力人类生活得更好。
4. Extreme Model-System Codesign
4. 极致的模型-系统协同设计
Achieving 1000+ tokens/s generation speed with a 1T flagship model is not the breakthrough of a single technique — it is the product of deep collaboration and extreme Codesign between the MiMo model team and the TileRT system team. The industry’s current approach to similar extreme speeds typically relies on specialized hardware — Cerebras’s Wafer-Scale integration or Groq’s pure on-chip SRAM custom architecture. We chose a different path: achieving even more impressive inference speed on commodity GPUs through model-system codesign alone. 在 1T 旗舰模型上实现 1000+ tokens/s 的生成速度,并非单一技术的突破,而是 MiMo 模型团队与 TileRT 系统团队深度协作与极致协同设计(Codesign)的产物。目前行业内实现类似极致速度的方法通常依赖专用硬件——如 Cerebras 的晶圆级集成或 Groq 的纯片上 SRAM 定制架构。我们选择了另一条路径:仅通过模型与系统的协同设计,在通用 GPU 上实现了更令人惊叹的推理速度。
On the model side, we applied FP4 quantization targeting the bandwidth bottleneck of commodity hardware, dramatically shrinking model size and reducing memory-access overhead; simultaneously, we introduced DFlash, an efficient speculative decoding method based on block-level masked parallel prediction, substantially increasing the accepted token length per verification step. On the system side, TileRT perfectly adapts to the dynamic characteristics of these algorithms, delivering a tailor-made compilation engine and compute kernels optimized specifically for the novel quantization and speculative decoding pipeline. Through this extreme Codesign, we achieved 1000+ tokens/s output from a 1T model using just a single standard 8-GPU commodity node. 在模型侧,我们针对通用硬件的带宽瓶颈应用了 FP4 量化,大幅缩小了模型体积并降低了内存访问开销;同时,我们引入了 DFlash,这是一种基于块级掩码并行预测的高效投机解码方法,显著增加了每步验证的接受 Token 长度。在系统侧,TileRT 完美适配了这些算法的动态特性,提供了专门为新型量化和投机解码流水线优化的定制编译引擎和计算内核。通过这种极致的协同设计,我们仅使用单个标准的 8-GPU 通用节点,就实现了 1T 模型每秒 1000+ Token 的输出。
3.1 FP4 Quantization
3.1 FP4 量化
At the trillion-parameter (1T) scale, traditional 8-bit (FP8 / INT8) or even 16-bit inference imposes prohibitive memory footprint and bandwidth pressure. Reducing parameter bit-width directly contributes to decoding speed. We therefore adopt the widely validated, virtually lossless FP4 (MXFP4) quantization format[1]. 在万亿参数(1T)规模下,传统的 8-bit(FP8 / INT8)甚至 16-bit 推理会带来巨大的内存占用和带宽压力。降低参数位宽直接有助于提升解码速度。因此,我们采用了经过广泛验证、几乎无损的 FP4 (MXFP4) 量化格式[1]。
However, naively applying FP4 across the entire model causes degradation in complex reasoning, logic, and code generation. Given the MoE (Mixture of Experts) architecture of Xiaomi MiMo-V2.5-Pro — where Experts constitute the vast majority of parameters and exhibit the h… 然而,在整个模型上简单地应用 FP4 会导致复杂推理、逻辑和代码生成能力的下降。鉴于小米 MiMo-V2.5-Pro 的 MoE(混合专家)架构——其中专家模型构成了绝大部分参数,并表现出……