xAI is looking more like a datacentre REIT than a frontier lab

xAI is looking more like a datacentre REIT than a frontier lab

xAI 看起来更像是一家数据中心 REIT,而非前沿实验室

An unexpected development over the past few weeks is xAI’s new partnerships with Anthropic and Google, providing them with a huge amount of capacity. It’s worth remembering that xAI is now part of SpaceX, after the two merged back in February - so the revenue from these deals flows straight into the entity about to go public. While much has been made of the potential financial engineering given SpaceX’s upcoming IPO, I think there’s a bit more to this than just pure accounting tricks.

过去几周出现了一个意想不到的发展:xAI 与 Anthropic 和 Google 达成了新的合作伙伴关系,为其提供了巨大的算力容量。值得记住的是,xAI 在二月份与 SpaceX 合并后,现已成为 SpaceX 的一部分——因此,这些交易产生的收入将直接流入这家即将上市的实体。虽然鉴于 SpaceX 即将进行的 IPO,人们对潜在的财务工程(financial engineering)议论纷纷,但我认为这不仅仅是纯粹的会计技巧。

Anthropic was in a serious bind. If you use Claude products much, you’ll be (very, probably) aware that Anthropic has had serious capacity problems, especially early afternoon onwards in Europe and in the mornings in the US (this is when demand seems to be highest as both European users and the Americas are both at work, fighting for capacity). I’ve written about this compute crunch before a few times - the coming crunch, whether it’s here yet, and what comes next. This resulted in Anthropic having to introduce new peak hour restrictions on their subscriptions, with usage between 5am–11am PT / 1pm–7pm GMT using more of your usage limit - with the aim of smoothing demand between peak hours and off peak hours where they had more capacity available. However, there is only so much demand shifting you can do when demand is growing as fast as Anthropic’s. At some point you end up having to ration users further, which definitely is far from ideal when you have both Google and OpenAI breathing down your neck for customers.

Anthropic 曾陷入严重的困境。如果你经常使用 Claude 产品,你(很可能)会意识到 Anthropic 一直存在严重的容量问题,尤其是在欧洲的下午早些时候和美国的上午(这是需求似乎最高的时候,因为欧洲用户和美洲用户都在工作,争夺算力)。我之前曾多次撰文讨论这种算力紧缺——即将到来的紧缺、它是否已经到来,以及接下来会发生什么。这导致 Anthropic 不得不对其订阅服务引入新的高峰时段限制,在太平洋时间上午 5 点至 11 点 / 格林威治标准时间下午 1 点至 7 点之间的使用量会消耗更多的配额,目的是在高峰时段和他们有更多可用容量的非高峰时段之间平抑需求。然而,当需求像 Anthropic 这样快速增长时,需求转移的能力是有限的。在某个节点,你最终不得不进一步限制用户,当 Google 和 OpenAI 都在身后紧追不舍地争夺客户时,这绝对不是理想的选择。

xAI to the rescue? At the start of May, xAI announced a partnership with Anthropic to provide access to their (older) Colossus 1 datacentre in Memphis. This allowed Anthropic to reverse the usage limit restrictions on their subscriptions, and in general while stability of Anthropic services still leaves a lot to be desired, the peak time crunch has abated (for now, at least). The fees involved are enormous, ramping to $1.25bn/month for 300MW of capacity - approximately 220k GPUs. Last week, Google announced a similar partnership - $920mn/month for 110k GPUs. It’s important to note that both agreements have cancellation clauses - allowing either party to cancel with 90 days’ notice after an initial lock-in period. If you take this on face value, this is a ludicrously profitable deal for xAI: While this doesn’t include opex and depreciation, if the deals continue for 18 months, xAI recoups all the capex they spent and still has many hundreds of MW of GPUs available. With the giant compute shortages likely to persist into the medium term, even older H100s are likely to be extremely useful even 18 months out.

xAI 出手相救?5 月初,xAI 宣布与 Anthropic 建立合作伙伴关系,为其提供对其位于孟菲斯的(较旧的)Colossus 1 数据中心的访问权限。这使得 Anthropic 能够取消其订阅服务的使用限制。总的来说,虽然 Anthropic 服务的稳定性仍有待提高,但高峰时段的紧缺状况(至少目前)已经缓解。涉及的费用非常巨大,300MW 容量(约 22 万个 GPU)的费用高达每月 12.5 亿美元。上周,Google 宣布了类似的合作——每月 9.2 亿美元,提供 11 万个 GPU。需要注意的是,这两项协议都有取消条款——允许任何一方在初始锁定期后,提前 90 天通知即可取消。如果从表面上看,这对 xAI 来说是一笔极其有利可图的交易:虽然这不包括运营支出(opex)和折旧,但如果这些交易持续 18 个月,xAI 将收回其投入的所有资本支出(capex),并且仍拥有数百兆瓦的 GPU 可用。由于巨大的算力短缺可能会在中期内持续,即使是较旧的 H100 在 18 个月后也可能非常有用。

The case against: It’s important to note there are certainly some red flags with the deal. Firstly, Elon Musk and OpenAI were/are locked in a bitter legal battle, and the Anthropic deal could be motivated to add pressure to OpenAI more than commercial reality. And Google is a major shareholder in SpaceX, so they certainly have incentive to juice the valuation of the IPO. While I’m sure there is some degree (potentially a lot!) of truth in these viewpoints, it’s important to note that huge volumes of GPUs are in enormously short supply. One of the untold stories of this capex boom in datacentres is just how behind all of them are. Even OpenAI’s flagship Stargate UAE datacentre - being built in a jurisdiction that is renowned for a laissez-faire attitude to building regulations - is now under direct threat from the current Iran conflict, with Iranian drones having already hit other UAE datacentres. In comparison, SpaceX/xAI are incredible at building datacentres on time. The original Colossus 1 datacentre was built in 122 days. Musk’s empire does have a huge advantage in really understanding how to plan, build and execute enormous infrastructure projects quickly. While the hyperscalers no doubt have the experience to do this, they were built with far less urgency - with typical project execution taking many years. Given the capex only really started to ramp up in the last couple of years, many of these projects are still years away. This gives xAI a serious competitive advantage that shouldn’t in my opinion just be hand waved away.

反对意见:必须指出,这笔交易确实存在一些危险信号。首先,埃隆·马斯克与 OpenAI 陷入了激烈的法律斗争,Anthropic 的交易可能更多是为了向 OpenAI 施压,而非出于商业现实。此外,Google 是 SpaceX 的主要股东,因此他们当然有动力去推高 IPO 的估值。虽然我相信这些观点在一定程度上(甚至很大程度上!)是正确的,但必须指出,海量的 GPU 供应极其短缺。数据中心资本支出热潮中鲜为人知的故事之一,就是所有这些项目都落后了多少。即使是 OpenAI 的旗舰项目 Stargate UAE 数据中心——它是在一个以对建筑法规采取自由放任态度而闻名的司法管辖区内建造的——现在也直接受到当前伊朗冲突的威胁,伊朗无人机已经袭击了阿联酋的其他数据中心。相比之下,SpaceX/xAI 在按时建造数据中心方面表现惊人。最初的 Colossus 1 数据中心仅用了 122 天就建成了。马斯克的帝国在如何快速规划、建设和执行大型基础设施项目方面确实拥有巨大的优势。虽然超大规模云厂商无疑有经验做到这一点,但他们的建设紧迫性要低得多——典型的项目执行需要多年时间。鉴于资本支出实际上是在过去几年才开始真正增加,许多此类项目距离完工还需要数年时间。在我看来,这赋予了 xAI 一种不应被轻易忽视的严重竞争优势。

But what about Grok? There is no doubt this leaves Grok in an odd spot, with a lot of the datacentre capacity that was destined for Grok training and inference now being leased to a direct competitor. While it’s foolish to write off any model provider, it certainly looks like a serious retreat from Grok vying to be a frontier class lab. But, perhaps, they over-specified their datacentre capacity - there is no doubt that inference demand for Grok models is likely to be seriously behind projections, leaving a bunch of spare capacity which might as well be monetised while the training lottery continues? It’s hard to say and the xAI & Cursor deal muddies the water even further. As such, I think all three things are true to some degree. There’s no doubt some level of financial engineering going on. There’s also an enormous compute shortage. And it seems to me SpaceX/xAI does have a real competitive advantage in datacentre buildout. It’s just the magnitude of how true each of these are is going to define the success or failure of the biggest IPO in North American history. Either way, the more I look at it, the more xAI is starting to resemble a datacentre REIT with a frontier lab attached, rather than the other way around.

那么 Grok 呢?毫无疑问,这让 Grok 处于一个尴尬的境地,原本用于 Grok 训练和推理的大量数据中心容量现在被租给了直接竞争对手。虽然轻视任何模型提供商都是愚蠢的,但这看起来确实是 Grok 在争夺前沿实验室地位方面的一次严重退缩。但是,也许他们高估了自己的数据中心容量需求——毫无疑问,Grok 模型的推理需求很可能严重落后于预期,留下了大量闲置容量,在训练彩票(training lottery)继续进行的同时,将其变现岂不是更好?这很难说,而 xAI 与 Cursor 的交易让情况变得更加扑朔迷离。因此,我认为这三点在某种程度上都是真实的。毫无疑问,存在一定程度的财务工程;同时也存在巨大的算力短缺;在我看来,SpaceX/xAI 在数据中心建设方面确实拥有真正的竞争优势。这三者各自真实的程度,将决定北美历史上最大 IPO 的成败。无论如何,我越看越觉得,xAI 开始看起来更像是一家附带前沿实验室的数据中心 REIT,而不是反过来。