CUDA Proves Nvidia Is a Software Company

CUDA 证明了英伟达是一家软件公司

Forgive me for starting with a cliché, a piece of finance jargon that has recently slipped into the tech lexicon, but I’m afraid I must talk about “moats.” Popularized decades ago by Warren Buffett to refer to a company’s competitive advantage, the word found its way into Silicon Valley pitch decks when a memo purportedly leaked from Google, titled “We Have No Moat, and Neither Does OpenAI,” fretted that open-source AI would pillage Big Tech’s castle.

请原谅我以陈词滥调开篇，这是一个最近潜入科技词汇表的金融术语，但我恐怕必须谈谈“护城河”（moats）。几十年前，沃伦·巴菲特（Warren Buffett）普及了这个词，用以指代公司的竞争优势。当一份据称从谷歌泄露、题为《我们没有护城河，OpenAI 也没有》的备忘录流出，担心开源人工智能将洗劫大型科技公司的城堡时，这个词便进入了硅谷的推介演示文稿中。

A few years on, the castle walls remain safe. Apart from a brief bout of panic when DeepSeek first appeared, open-source AI models have not vastly outperformed proprietary models. Still, none of the frontier labs—OpenAI, Anthropic, Google—has a moat to speak of.

几年过去了，城堡的城墙依然稳固。除了 DeepSeek 刚出现时引发的一阵短暂恐慌外，开源 AI 模型并没有大幅超越专有模型。尽管如此，没有任何一家前沿实验室——无论是 OpenAI、Anthropic 还是谷歌——拥有真正意义上的护城河。

The company that does have a moat is Nvidia. CEO Jensen Huang has called it his most precious “treasure.” It is not, as you might assume for a chip company, a piece of hardware. It’s something called CUDA. What sounds like a chemical compound banned by the FDA may be the one true moat in AI.

真正拥有护城河的公司是英伟达（Nvidia）。首席执行官黄仁勋（Jensen Huang）称其为他最宝贵的“财富”。正如你对一家芯片公司所预期的那样，它并非某种硬件，而是一种名为 CUDA 的东西。这个听起来像是被美国食品药品监督管理局（FDA）禁用的化学化合物的东西，可能是人工智能领域唯一的真正护城河。

CUDA technically stands for Compute Unified Device Architecture, but much like laser or scuba, no one bothers to expand the acronym; we just say “KOO-duh.” So what is this all-important treasure good for? If forced to give a one-word answer: parallelization.

CUDA 在技术上代表“统一计算设备架构”（Compute Unified Device Architecture），但就像激光（laser）或水肺（scuba）一样，没人会费心去展开这个缩写；我们只是称之为“KOO-duh”。那么，这个至关重要的财富有什么用呢？如果非要用一个词来回答：并行化（parallelization）。

Here’s a simple example. Let’s say we task a machine with filling out a 9×9 multiplication table. Using a computer with a single core, all 81 operations are executed dutifully one by one. But a GPU with nine cores can assign tasks so that each core takes a different column—one from 1×1 to 1×9, another from 2×1 to 2×9, and so on—for a ninefold speed gain. Modern GPUs can be even cleverer. For example, if programmed to recognize commutativity—7×9 = 9×7—they can avoid duplicate work, reducing 81 operations to 45, nearly halving the workload. When a single training run costs a hundred million dollars, every optimization counts.

举个简单的例子。假设我们让机器填写一张 9×9 的乘法表。使用单核计算机，所有 81 次运算都必须按部就班地逐一执行。但拥有九个核心的 GPU 可以分配任务，让每个核心负责不同的一列——一个负责 1×1 到 1×9，另一个负责 2×1 到 2×9，以此类推——从而实现九倍的速度提升。现代 GPU 可以更聪明。例如，如果编程识别交换律（7×9 = 9×7），它们可以避免重复工作，将 81 次运算减少到 45 次，几乎减半了工作量。当单次训练成本高达一亿美元时，每一次优化都至关重要。

Nvidia’s GPUs were originally built to render graphics for video games. In the early 2000s, a Stanford PhD student named Ian Buck, who first got into GPUs as a gamer, realized their architecture could be repurposed for general high-performance computing. He created a programming language called Brook, was hired by Nvidia, and, with John Nickolls, led the development of CUDA. If AI ushers in the age of a permanent white-collar underclass and autonomous weapons, just know that it would all be because someone somewhere playing Doom thought a demon’s scrotum should jiggle at 60 frames per second.

英伟达的 GPU 最初是为电子游戏渲染图形而制造的。21 世纪初，斯坦福大学的一名博士生伊恩·巴克（Ian Buck）——他最初是因为玩游戏才接触 GPU 的——意识到它们的架构可以被重新用于通用高性能计算。他创建了一种名为 Brook 的编程语言，随后被英伟达聘用，并与约翰·尼科尔斯（John Nickolls）共同领导了 CUDA 的开发。如果人工智能开启了一个永久性的白领底层阶级和自主武器的时代，请记住，这一切仅仅是因为某个地方玩《毁灭战士》（Doom）的人认为恶魔的阴囊应该以每秒 60 帧的速度晃动。

CUDA is not a programming language in itself but a “platform.” I use that weasel word because, not unlike how The New York Times is a newspaper that’s also a gaming company, CUDA has, over the years, become a nested bundle of software libraries for AI. Each function shaves nanoseconds off single mathematical operations—added up, they make GPUs, in industry parlance, go brrr.

CUDA 本身不是一种编程语言，而是一个“平台”。我使用这个含糊的词是因为，就像《纽约时报》既是报纸也是游戏公司一样，多年来，CUDA 已经演变成一个为 AI 服务的嵌套软件库集合。每个函数都能在单次数学运算中节省纳秒级的时间——累积起来，用行业术语来说，它们让 GPU 运行得飞快（go brrr）。

A modern graphics card is not just a circuit board crammed with chips and memory and fans. It’s an elaborate confection of cache hierarchies and specialized units called “tensor cores” and “streaming multiprocessors.” In that sense, what chip companies sell is like a professional kitchen, and more cores are akin to more grilling stations. But even a kitchen with 30 grilling stations won’t run any faster without a capable head chef deftly assigning tasks—as CUDA does for GPU cores.

现代显卡不仅仅是一块塞满了芯片、内存和风扇的电路板。它是由缓存层级和被称为“张量核心”（tensor cores）及“流式多处理器”（streaming multiprocessors）的专用单元组成的复杂组合。从这个意义上说，芯片公司销售的产品就像一个专业厨房，更多的核心类似于更多的烧烤台。但如果没有一位能干的主厨巧妙地分配任务——正如 CUDA 为 GPU 核心所做的那样——即使拥有 30 个烧烤台的厨房也不会运行得更快。

To extend the metaphor, hand-tuned CUDA libraries optimized for one matrix operation are the equivalent of kitchen tools designed for a single job and nothing more—a cherry pitter, a shrimp deveiner—which are indulgences for home cooks but not if you have 10,000 shrimp guts to yank out. Which brings us back to DeepSeek. Its engineers went below this already deep layer of abstraction to work directly in PTX, a kind of assembly language for Nvidia GPUs. Let’s say the task is peeling garlic. An unoptimized GPU would go: “Peel the skin with your fingernails.” CUDA can instruct: “Smash the clove with the flat of a knife.” PTX lets you dictate every sub-instruction: “Lift the blade 2.35 inches above the cutting board, make it parallel to the clove’s equator, and strike downward with your palm at a force of 36.2 newtons.”

延伸这个比喻，为单一矩阵运算优化的手工调优 CUDA 库，相当于那些只为单一工作设计的厨房工具——去核器、去虾线器——对于家庭厨师来说是奢侈品，但如果你有 10,000 只虾要处理，那就不是了。这让我们回到了 DeepSeek。它的工程师深入到这个已经很深的抽象层之下，直接使用 PTX（一种英伟达 GPU 的汇编语言）进行工作。假设任务是剥大蒜。未优化的 GPU 会说：“用指甲剥皮。”CUDA 可以指示：“用刀面拍碎蒜瓣。”而 PTX 让你能够规定每一个子指令：“将刀刃抬高至砧板上方 2.35 英寸，使其与蒜瓣赤道平行，并用手掌以 36.2 牛顿的力向下拍击。”

You can begin to see why CUDA is so valuable to Nvidia—and so hard for anyone else to touch. Tuning GPU performance is a gnarly problem. You can’t just conscript some tender-footed undergrad on Market Street, hand them a Claude Max plan, and expect them to hack GPU kernels. Writing at this level is a grindsome enterprise—unless you’re a cracker-jack programmer at DeepSeek.

你现在可以开始明白为什么 CUDA 对英伟达如此宝贵，以及为什么其他人如此难以触及。调优 GPU 性能是一个棘手的问题。你不能随便在市场街（Market Street）拉个没经验的本科生，给他们一个 Claude Max 账号，就指望他们能破解 GPU 内核。在这个层面进行编程是一项艰苦的工作——除非你是 DeepSeek 那样的顶尖程序员。

A disclosure: In previous Machine Readable columns, I was already familiar with the languages I was analyzing. Not so here. In the interest of maintaining this standard, I decided to spend a day with CUDA. It ruined my afternoon.

披露：在之前的《机器可读》（Machine Readable）专栏中，我对我分析的语言都很熟悉。但这次不同。为了保持这一标准，我决定花一天时间研究 CUDA。它毁了我的整个下午。

A simple matrix multiplication that usually takes me three lines in PyTorch—a popular machine-learning framework—took me 50-plus lines in CUDA. Wringing out the last drop of performance, it turns out, is an admirable but tedious business. Having dipped my toe in the moat, I can report that it is indeed deep and forbidding.

一个简单的矩阵乘法在流行的机器学习框架 PyTorch 中通常只需要三行代码，但在 CUDA 中却花了我 50 多行。事实证明，榨干最后一点性能是一项令人钦佩但极其乏味的工作。在护城河里试水之后，我可以负责任地说，它确实深不可测且令人望而生畏。

CUDA’s dominance is built not just on the quality of its ecosystem but on a lock-in effect. Because modern machine-learning frameworks are built on CUDA, which crucially runs on Nvidia chips, AMD’s chips underperform even when they have more cores and memory. Comparing chips by spec sheets is like comparing race cars by cylinder count, whereas real performance can only be measured on the track.

CUDA 的主导地位不仅建立在其生态系统的质量上，还建立在一种锁定效应（lock-in effect）之上。由于现代机器学习框架是建立在 CUDA 之上的，而 CUDA 关键性地运行在英伟达芯片上，因此即使 AMD 的芯片拥有更多的核心和内存，其性能表现依然不如英伟达。通过规格表比较芯片就像通过气缸数量比较赛车，而真正的性能只能在赛道上衡量。

A second disclosure: I intended to benchmark two chips, but there was no way to expense an Nvidia H100 and an AMD MI300X without landing on Condé Nast’s blacklist. Instead, you will have to take the word of independent researchers who found that even with better specs on paper, AMD was outmatched by Nvidia.

第二次披露：我本打算对两款芯片进行基准测试，但如果不被康泰纳仕（Condé Nast）列入黑名单，我根本无法报销购买英伟达 H100 和 AMD MI300X 的费用。因此，你只能相信独立研究人员的结论：即使纸面规格更好，AMD 依然不敌英伟达。

Nvidia’s edge in software might be that, unusual for a chip company, it hires more software engineers than hardware engineers. If I were running AMD, I…

英伟达在软件方面的优势可能在于，对于一家芯片公司来说不同寻常的是，它雇佣的软件工程师比硬件工程师多。如果我是 AMD 的负责人，我会……