Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

Orthrus-Qwen3: Qwen3 上实现高达 7.8 倍的 Token 生成速度,且保持输出分布完全一致

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion Orthrus:基于双视图扩散的高效内存并行 Token 生成框架。这是 Orthrus 的官方实现及模型检查点,该双架构框架统一了自回归大语言模型(LLM)的精确生成保真度与扩散模型的高速并行 Token 生成能力。


Model Zoo (模型库)

All models use a Qwen3 backbone and guarantee strictly lossless generation. 所有模型均使用 Qwen3 作为基座,并保证严格的无损生成。

ModelBase ModelHuggingFaceAvg. Speedup
Orthrus-Qwen3-1.7BQwen3-1.7B🤗 HuggingFace4.25×
Orthrus-Qwen3-4BQwen3-4.0B🤗 HuggingFace5.20×
Orthrus-Qwen3-8BQwen3-8.0B🤗 HuggingFace5.36×

Installation (安装)

We recommend uv for fast dependency resolution. 我们推荐使用 uv 以实现快速的依赖解析。

uv pip install -e .
uv pip install ninja packaging
uv pip install flash-attn --no-build-isolation 
# or: pip install "flash-attn-4[cu13]" if your device supports it
# 或者:如果您的设备支持,请使用 pip install "flash-attn-4[cu13]"

Quickstart (快速开始)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model = AutoModelForCausalLM.from_pretrained(
    "chiennv/Orthrus-Qwen3-8B",
    dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2", # use flash_attention_4 if your system does support
    trust_remote_code=True,
).eval()

tokenizer = AutoTokenizer.from_pretrained("chiennv/Orthrus-Qwen3-8B")
prompt = "Write a program to count the frequency of each word in a paragraph."
messages = [{"role": "system", "content": ""}, {"role": "user", "content": prompt}]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True, enable_thinking=False).input_ids

output_ids = model.generate(
    input_ids=input_ids.to(model.device),
    max_new_tokens=2048,
    use_diffusion_mode=True,
    streamer=TextStreamer(tokenizer, skip_prompt=True) # enable streaming generation
)

Coming soon: Native integration with vLLM and SGLang is coming soon. Stay tuned! 即将推出: 对 vLLM 和 SGLang 的原生集成即将发布,敬请期待!


Key Advantages (核心优势)

  • Significant Inference Acceleration: Breaks the sequential bottleneck of standard autoregressive decoding, delivering up to a $7.8\times$ speedup on generation tasks. 显著的推理加速: 打破了标准自回归解码的顺序瓶颈,在生成任务中实现了高达 7.8 倍的加速。
  • Strictly Lossless Generation: Employs an exact intra-model consensus mechanism to guarantee that the output matches the original base model’s exact predictive distribution. 严格的无损生成: 采用精确的模型内共识机制,确保输出与原始基座模型的预测分布完全一致。
  • Zero Redundant Memory Overhead: Both the autoregressive and diffusion views attend to the exact same high-fidelity Key-Value (KV) cache natively, resulting in only an $O(1)$ memory cache overhead. 零冗余内存开销: 自回归视图和扩散视图原生共享完全相同的高保真键值(KV)缓存,仅产生 $O(1)$ 的内存缓存开销。
  • Parameter Efficient: Parallel generation capabilities are injected by fine-tuning only 16% of the total model parameters while keeping the base LLM strictly frozen. 参数高效: 通过仅微调 16% 的模型参数注入并行生成能力,同时保持基座 LLM 完全冻结。

Performance Comparison: Orthrus vs. Speculative Decoding

性能对比:Orthrus 与投机采样(Speculative Decoding)

Orthrus outperforms speculative decoding methods like EAGLE-3, DFlash. By natively sharing the exact same KV cache across dual views, Orthrus avoids the redundant memory overhead of draft models, resulting in significantly higher token acceptance rates and faster inference times, especially as context length scales. Orthrus 在性能上优于 EAGLE-3 和 DFlash 等投机采样方法。通过在双视图间原生共享完全相同的 KV 缓存,Orthrus 避免了草稿模型带来的冗余内存开销,从而显著提高了 Token 接受率并缩短了推理时间,尤其是在上下文长度增加时表现更为突出。


Comparison with State-of-the-Art Diffusion Models

与最先进扩散模型的对比

While recent diffusion language models (dLLMs) offer parallel decoding, they often suffer from significant conditional drift and severe accuracy degradation on complex reasoning tasks. Orthrus resolves this by decoupling parallel generation from sequential constraints, establishing a new state-of-the-art for parallel generation fidelity. 尽管近期的扩散语言模型(dLLMs)提供了并行解码能力,但它们往往存在严重的条件漂移,且在复杂推理任务中准确率大幅下降。Orthrus 通过将并行生成与顺序约束解耦解决了这一问题,为并行生成的保真度树立了新的行业标杆。

Throughput vs. Accuracy on MATH-500. Orthrus delivers a ~6x speedup over the Qwen3-8B baseline with strictly lossless performance, whereas adaptations like Fast-dLLM-v2 suffer significant accuracy drops. MATH-500 上的吞吐量与准确率对比: Orthrus 在保持严格无损性能的同时,比 Qwen3-8B 基准模型实现了约 6 倍的加速,而 Fast-dLLM-v2 等适配方案则出现了明显的准确率下降。


Citation (引用)

If you find this model or architecture useful in your work, please cite our paper: 如果您在工作中使用了此模型或架构,请引用我们的论文:

@misc{vannguyen2026orthrusmemoryefficientparalleltoken,
      title={Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion}, 
      author={Chien Van Nguyen and Chaitra Hegde and Van Cuong Pham and Ryan A. Rossi and Franck Dernoncourt and Thien Huu Nguyen},
      year={2026},
      eprint={2605.12825},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.12825},
}