Gemma 4 12B: A unified, encoder-free multimodal model

Gemma 4 12B: A unified, encoder-free multimodal model

Gemma 4 12B:一款统一的、无编码器的多模态模型

Gemma 4 12B is designed to bring high-performance multimodal intelligence directly to your laptop, combining mobile-first efficiency with advanced reasoning. Gemma 4 12B 旨在将高性能的多模态智能直接带到您的笔记本电脑上,将移动优先的效率与先进的推理能力相结合。

Today, we are introducing Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B and our more advanced 26B Mixture of Experts (MoE), Gemma 4 12B packages powerful capabilities inside a reduced memory footprint. It is also our first mid-sized model to feature native audio inputs. 今天,我们推出了 Gemma 4 12B,这是我们最新的模型,旨在将代理式(agentic)多模态智能直接引入笔记本电脑。Gemma 4 12B 填补了我们面向边缘设备的 E4B 模型与更先进的 26B 混合专家模型(MoE)之间的空白,在更小的内存占用下集成了强大的功能。它也是我们首个支持原生音频输入的中型模型。

Thanks to the developer community, Gemma 4 models have now crossed 150 million downloads. You’ve built everything from wearable robotic arms for physical assistance to enterprise-grade AI security. We’re excited to see what you build with this latest addition. 感谢开发者社区的支持,Gemma 4 系列模型的下载量现已突破 1.5 亿次。你们利用它构建了从用于物理辅助的可穿戴机械臂到企业级 AI 安全系统等各种应用。我们非常期待看到你们利用这一最新成员创造出更多成果。

Here’s an overview of what makes Gemma 4 12B unique: 以下是 Gemma 4 12B 的独特之处概览:

  • Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone. 创新的统一架构: 无需多模态编码器。视觉和音频输入直接流入大语言模型(LLM)主干。
  • Advanced reasoning: Benchmark performance nearing our 26B model, unlocking powerful multi-step reasoning and agentic workflows. 先进的推理能力: 基准测试性能接近我们的 26B 模型,解锁了强大的多步推理和代理工作流。
  • Laptop ready: Small enough to run locally with just 16GB of VRAM or unified memory. 适配笔记本电脑: 体积小巧,仅需 16GB 显存或统一内存即可在本地运行。
  • Open and accessible: Released under an Apache 2.0 license with support across the developer ecosystem. 开放且易于获取: 基于 Apache 2.0 许可证发布,并获得整个开发者生态系统的支持。
  • Drafter-ready: Gemma 4 12B comes equipped with Multi-Token Prediction (MTP) drafters to reduce latency. 支持草稿模型: Gemma 4 12B 配备了多标记预测(MTP)草稿模型,以降低延迟。

Together, these features bring advanced multimodal capabilities to everyday hardware without sacrificing speed or reasoning. Let’s now take a closer look at how Gemma 4 12B achieves this. 总之,这些特性在不牺牲速度或推理能力的前提下,将先进的多模态功能带到了日常硬件上。现在,让我们深入了解 Gemma 4 12B 是如何实现这一点的。

Run state-of-the-art agents locally

在本地运行最先进的智能体

Gemma 4 12B delivers performance nearing our larger 26B MoE model on standard benchmarks, but at less than half the total memory footprint. Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine. Gemma 4 12B 在标准基准测试中的表现接近我们更大的 26B MoE 模型,但总内存占用不到后者的一半。它足够小,可以在配备 16GB 内存的消费级笔记本电脑上本地运行,从而在您的机器上解锁强大的多模态和代理式体验。

Experience a uniquely efficient, unified architecture

体验独特高效的统一架构

What makes Gemma 4 12B stand out is its streamlined approach to processing visual and audio inputs. Traditional multimodal models typically rely on separate encoders to translate images and audio before passing those representations to the language model. Because these split encoders add latency and increase memory usage, we trained Gemma 4 12B with an encoder-free architecture to integrate audio and vision input directly. Gemma 4 12B 的过人之处在于其处理视觉和音频输入的精简方法。传统的多模态模型通常依赖独立的编码器来转换图像和音频,然后再将这些表征传递给语言模型。由于这些分离的编码器会增加延迟并提高内存使用率,我们采用了一种无编码器架构来训练 Gemma 4 12B,从而直接整合音频和视觉输入。

Here is how Gemma 4 12B processes multimodal inputs natively: 以下是 Gemma 4 12B 原生处理多模态输入的方式:

  • Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations. This allows the LLM backbone to take over visual processing. 视觉: 我们用一个轻量级的嵌入模块替换了 Gemma 4 的视觉编码器,该模块仅包含一次矩阵乘法、位置嵌入和归一化。这使得 LLM 主干能够接管视觉处理任务。
  • Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens. 音频: 我们进一步简化了音频处理。我们完全移除了音频编码器,并将原始音频信号投影到与文本标记相同的维度空间中。

For developers who want a breakdown, head over to our companion Gemma 4 12B Developer Guide. 对于想要了解详细信息的开发者,请前往我们的配套文档《Gemma 4 12B 开发者指南》。

Get started today

立即开始使用

  • Try it yourself: Experiment with a couple of clicks in LM Studio, Ollama, Google AI Edge Gallery App, the Google AI Edge Eloquent app and the LiteRT-LM CLI. 亲自尝试: 在 LM Studio、Ollama、Google AI Edge Gallery App、Google AI Edge Eloquent app 和 LiteRT-LM CLI 中点击几下即可体验。
  • Download the weights: Download the pre-trained and instruction-tuned checkpoints directly from Hugging Face and Kaggle. 下载权重: 直接从 Hugging Face 和 Kaggle 下载预训练和指令微调的检查点。
  • Integrate & learn: Review the developer documentation and the quick start notebook. 集成与学习: 查看开发者文档和快速入门笔记本。
  • Use your favorite development tools: Implement local inference pipelines with Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM, or fine-tune with efficiency using Unsloth. 使用您喜爱的开发工具: 使用 Hugging Face Transformers、llama.cpp、MLX、SGLang 和 vLLM 实现本地推理流水线,或使用 Unsloth 进行高效微调。
  • Unlock Agentic Development with Gemma Skills: To support agents to build with the latest Gemma advancements, we are releasing our official Skills Repository. This is a library of skills designed specifically to enable agents to build with Gemma models. 利用 Gemma Skills 解锁代理开发: 为了支持开发者利用最新的 Gemma 进展构建智能体,我们发布了官方的“技能库”(Skills Repository)。这是一个专门设计的技能库,旨在使智能体能够利用 Gemma 模型进行构建。
  • Deploy your way: Spin up endpoints in production using Google Cloud. Deploy your way through Gemini Enterprise Agent Platform Model Garden, Cloud Run and GKE. 按需部署: 使用 Google Cloud 在生产环境中启动端点。通过 Gemini Enterprise Agent Platform Model Garden、Cloud Run 和 GKE 进行部署。