Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents
Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents
隆重推出 NVIDIA Nemotron 3 Nano Omni:面向文档、音频和视频智能体的长上下文多模态智能
NVIDIA Nemotron 3 Nano Omni is a new omni-modal understanding model built for real-world document analysis, multiple image reasoning, automatic speech recognition, long audio-video understanding, agentic computer use, and general reasoning. It extends the Nemotron multimodal line from a strong vision-language system to a broader text + image + video + audio model. NVIDIA Nemotron 3 Nano Omni 是一款全新的全模态理解模型,专为现实世界的文档分析、多图像推理、自动语音识别、长音频/视频理解、智能体计算机操作及通用推理而构建。它将 Nemotron 多模态产品线从强大的视觉-语言系统扩展为涵盖文本、图像、视频和音频的更广泛模型。
Nemotron 3 Nano Omni delivers best-in-class accuracy on complex document intelligence leaderboards such as MMlongbench-Doc, OCRBenchV2, while also leading in video and audio leaderboards like WorldSense and DailyOmni. It achieves top accuracy on VoiceBench for audio understanding and ranks as the most cost‑efficient open video understanding model on MediaPerf. Nemotron 3 Nano Omni 在 MMlongbench-Doc 和 OCRBenchV2 等复杂的文档智能排行榜上提供了同类最佳的准确性,同时在 WorldSense 和 DailyOmni 等视频和音频排行榜上也处于领先地位。它在 VoiceBench 音频理解测试中达到了最高准确率,并被 MediaPerf 评为最具成本效益的开源视频理解模型。
Under the hood, it combines the Nemotron 3 hybrid Mamba-Transformer Mixture-of-Experts backbone with a C-RADIOv4-H vision encoder and Parakeet-TDT-0.6B-v2 audio encoder. The architecture is designed to preserve fine visual detail, add native audio understanding, and scale to very long multimodal contexts for dense images, documents, videos, and mixed-modality reasoning. 在技术底层,它结合了 Nemotron 3 混合 Mamba-Transformer 专家混合(MoE)主干网络、C-RADIOv4-H 视觉编码器以及 Parakeet-TDT-0.6B-v2 音频编码器。该架构旨在保留精细的视觉细节,增加原生音频理解能力,并针对密集图像、文档、视频和混合模态推理扩展至超长多模态上下文。
The training recipe uses staged multimodal alignment and context extension, followed by preference optimization and multimodal reinforcement learning. Nemotron 3 Nano Omni delivers up to 9x higher throughput and 2.9x the single-stream reasoning speed on multimodal use-cases, compared to alternatives. Download the BF16, FP8 and NVFP4 checkpoints at HuggingFace. 其训练方案采用了分阶段的多模态对齐和上下文扩展,随后进行了偏好优化和多模态强化学习。与同类模型相比,Nemotron 3 Nano Omni 在多模态用例中的吞吐量提高了 9 倍,单流推理速度提升了 2.9 倍。您可以在 HuggingFace 上下载 BF16、FP8 和 NVFP4 检查点。
What Nemotron 3 Nano Omni is designed for
Nemotron 3 Nano Omni 的设计用途
At a high level, Nemotron 3 Nano Omni is aimed at five classes of workloads: 总体而言,Nemotron 3 Nano Omni 旨在处理五类工作负载:
-
Real-world document analysis: This is not only about OCR. The model is positioned for long, messy, high-value documents where understanding depends on layout, tables, figures, formulas, section structure, and cross-page references. Think contracts, technical papers, reports, manuals, multi-page forms, or compliance packets. The model can handle 100+ page documents.
-
现实世界的文档分析:这不仅仅是 OCR。该模型定位于处理长篇、复杂且高价值的文档,其理解能力依赖于对布局、表格、图表、公式、章节结构和跨页引用的把握。例如合同、技术论文、报告、手册、多页表单或合规性文件。该模型可以处理超过 100 页的文档。
-
Automatic Speech Recognition: Nemotron 3 Nano Omni includes strong speech understanding capabilities that enable high-quality transcription across diverse audio conditions. It handles long-form audio with varying speakers, accents, and background noise. These capabilities can be integrated into broader workflows, allowing spoken content to be transcribed, analyzed, and combined with other modalities for tasks like summarization, question answering, and cross-modal reasoning.
-
自动语音识别:Nemotron 3 Nano Omni 具备强大的语音理解能力,能够在各种音频条件下实现高质量转录。它能处理包含不同说话人、口音和背景噪音的长音频。这些能力可以集成到更广泛的工作流中,使口语内容能够被转录、分析,并与其他模态结合,用于摘要、问答和跨模态推理等任务。
-
Long audio-video understanding: Many enterprise and developer workflows depend on mixed audio and visual evidence: screen recordings with narration, training videos, meetings with slides, tutorials, product demos, customer support captures, and long-form video archives. Nemotron 3 Nano Omni is built to reason over those inputs jointly.
-
长音频/视频理解:许多企业和开发者的工作流依赖于音频和视觉的混合证据:带旁白的屏幕录制、培训视频、带幻灯片的会议、教程、产品演示、客户支持记录以及长篇视频存档。Nemotron 3 Nano Omni 专为对这些输入进行联合推理而构建。
-
Agentic computer use: The Nemotron 3 Nano Omni model is specifically trained for agentic computer use, enabling it to assist with tasks in graphical user interface (GUI) environments. Its capabilities include interpreting screenshots, monitoring the state of the user interface, grounding its reasoning in on-screen visuals, and helping with action selection or workflow automation.
-
智能体计算机操作:Nemotron 3 Nano Omni 模型经过专门训练,可用于智能体计算机操作,使其能够协助完成图形用户界面(GUI)环境中的任务。其能力包括解读屏幕截图、监控用户界面状态、基于屏幕视觉进行推理,以及辅助动作选择或工作流自动化。
-
General multimodal reasoning: The model is designed for more than perception. It excels at reasoning-intensive tasks that require synthesizing information across long context windows, multiple modalities, and structured or semi-structured evidence. It can carry out multi-step reasoning, perform calculations, and connect signals from text, images, tables, and other inputs to arrive at coherent, well-supported answers.
-
通用多模态推理:该模型的设计初衷不仅限于感知。它擅长处理需要跨长上下文窗口、多种模态以及结构化或半结构化证据来综合信息的推理密集型任务。它能够执行多步推理、进行计算,并将来自文本、图像、表格和其他输入的信号连接起来,从而得出连贯且有据可查的答案。