I put a datacenter GPU in my gaming PC

I put a datacenter GPU in my gaming PC

我把一块数据中心显卡装进了我的游戏电脑

I already had an RTX 4080. 16GB of VRAM. Good enough for gaming, not good enough for the models I wanted to run locally. The next step up in GPU land is either spend a fortune on a card with more VRAM, or find another way. I found another way. 我原本有一块 RTX 4080,拥有 16GB 显存。这对游戏来说足够了,但对于我想要在本地运行的模型来说还不够。在显卡领域,想要升级的下一步要么是花大价钱买一张显存更大的卡,要么就是寻找其他出路。我找到了另一条路。

I bought a datacenter GPU that doesn’t even have a normal PCIe connector, stuck it in my gaming PC with an adapter, and now I have 32GB of VRAM across two GPUs running a 27 billion parameter model at 32 tokens per second. The whole thing cost me £200. 我买了一块甚至没有标准 PCIe 接口的数据中心显卡,通过转接卡把它塞进了我的游戏电脑里。现在,我拥有了两块显卡共计 32GB 的显存,能够以每秒 32 个 token 的速度运行一个 270 亿参数的模型。整个方案只花了我 200 英镑。

The GPU

显卡本身

This is a Tesla V100 SXM2 16GB. It was designed for NVIDIA’s DGX servers and hyperscaler racks. The SXM2 form factor means it does not have a PCIe slot. It does not have display outputs. It does not have a normal power connector. It sits on a proprietary board inside a server rack and communicates over NVLink. 这是一块 Tesla V100 SXM2 16GB。它专为 NVIDIA 的 DGX 服务器和超大规模数据中心机架设计。SXM2 的外形意味着它没有 PCIe 插槽,没有显示输出接口,也没有普通的电源接口。它通常安装在服务器机架内的专用板卡上,并通过 NVLink 进行通信。

You cannot plug this into a motherboard. Not without help. But here is the thing: this is a Volta GPU with 16GB of HBM2 memory, 5120 CUDA cores, and I picked it up for about £150 on eBay. The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising. 你无法直接把它插到主板上,除非借助外力。但重点在于:这是一块拥有 16GB HBM2 显存和 5120 个 CUDA 核心的 Volta 架构显卡,我在 eBay 上只花了约 150 英镑。它的算力是实打实的,显存也是实打实的。最令人惊讶的是它的内存带宽。

HBM2 is a different class of memory. The V100 has a 4096-bit memory bus delivering 900 GB/s of bandwidth. To put that in perspective, my RTX 4080 with its fancy GDDR6X manages 736 GB/s. The V100 from 2017 has 22% more memory bandwidth than a GPU that launched in 2022. HBM2 是另一种级别的内存。V100 拥有 4096 位的内存总线,可提供 900 GB/s 的带宽。作为对比,我那块配备了高端 GDDR6X 显存的 RTX 4080 带宽为 736 GB/s。这块 2017 年的 V100 比 2022 年发布的显卡带宽还要高出 22%。

And it is not just NVIDIA’s consumer cards that lose. Apple’s M3 Max does 400 GB/s. The M4 Max does 546 GB/s. The brand new M5 Max, which will set you back over £3,000 for a laptop, manages 614 GB/s. A GPU from 2017 beats every Mac on the market. 不仅是 NVIDIA 的消费级显卡输了,苹果的 M3 Max 带宽为 400 GB/s,M4 Max 为 546 GB/s。而售价超过 3000 英镑的全新 M5 Max 笔记本电脑,带宽也仅为 614 GB/s。一块 2017 年的显卡击败了市面上所有的 Mac。

The closest AMD competition to my 4080 is the RX 7900 XTX, which does 960 GB/s on its 24GB of GDDR6. Technically that edges out the V100, but the 7900 XTX costs £700+ and ROCm support for LLM inference is still rough compared to CUDA. The V100 gives you 94% of that bandwidth for less than a quarter of the price, and it just works with llama.cpp. AMD 阵营中最接近我 4080 的是 RX 7900 XTX,其 24GB GDDR6 显存带宽为 960 GB/s。从技术上讲,它确实略胜 V100 一筹,但 7900 XTX 售价超过 700 英镑,且其用于大模型推理的 ROCm 支持相比 CUDA 仍不够成熟。V100 以不到四分之一的价格提供了 94% 的带宽,而且它能直接在 llama.cpp 上运行。

The only consumer GPU that comfortably beats it is the RTX 5090 at 1,792 GB/s, and that card costs over £2,000. For LLM inference, where memory bandwidth is the bottleneck that determines your tokens per second, this matters more than almost anything else. 唯一能轻松击败它的消费级显卡是带宽达 1792 GB/s 的 RTX 5090,但那张卡售价超过 2000 英镑。对于大模型推理而言,内存带宽是决定每秒 token 生成速度的瓶颈,这一点比其他任何因素都重要。

The adapter

转接卡

Turns out, someone makes an SXM2-to-PCIe adapter. It is not made by NVIDIA. It is not officially supported by anyone. It is a bare PCB with the SXM2 socket on one side and a PCIe edge connector on the other. I paid about £50 for it. Half of that might just be the copper. 事实证明,有人制造了 SXM2 转 PCIe 的转接卡。它不是 NVIDIA 制造的,也没有任何官方支持。它是一块裸露的 PCB 板,一面是 SXM2 插槽,另一面是 PCIe 金手指。我花了大约 50 英镑买下它,其中一半的成本可能就在铜材上。

So for about £200 total, I had a 16GB VRAM GPU that could slot into my motherboard alongside my RTX 4080. That is 32GB of total VRAM. A single RTX 5090 with 32GB costs over £2,000. I am not saying this is the same experience. I am saying the VRAM is the same. 所以总共花了约 200 英镑,我就得到了一块 16GB 显存的显卡,可以和我的 RTX 4080 一起插在主板上。总显存达到了 32GB。单张 32GB 显存的 RTX 5090 售价超过 2000 英镑。我并不是说两者的体验完全相同,但我可以说它们的显存是一样的。

The fan from hell

地狱般的风扇

Before I could do anything useful with the V100, I had to deal with the fan. The V100 SXM2 was designed to live inside a 2U server with industrial cooling. The fan on the adapter is not subtle. It is not quiet. It is not something you want in a room you also sleep in. 在用 V100 做任何有用的事情之前,我必须先处理风扇问题。V100 SXM2 是为 2U 服务器的工业级散热环境设计的。转接卡上的风扇一点也不含蓄,也不安静。你绝对不会想把它放在你睡觉的房间里。

I measured it with my Apple Watch: 82 decibels. That is somewhere between a garbage disposal and a lawnmower, well past “loud PC” and into “should I be wearing earplugs in my own house” territory. 我用 Apple Watch 测量了一下:82 分贝。这介于垃圾处理器和割草机之间,远远超过了“电脑噪音大”的范畴,达到了“我在自己家里是不是该戴耳塞”的程度。

And the worst part: you cannot control it. I tried nvidia-smi, I tried scanning for it on Linux, I even tried Afterburner on Windows. Nothing. The fan on this adapter is not designed to be controlled. It is designed to run at 100%, forever, inside a server rack where nobody has to hear it. 最糟糕的是:你无法控制它。我试过 nvidia-smi,试过在 Linux 下扫描它,甚至试过 Windows 上的 Afterburner。都没用。这个转接卡上的风扇根本没打算让你控制。它被设计成在服务器机架内永远以 100% 的转速运行,那里没人会听到它的声音。

Making the fan listen to reason

让风扇“讲道理”

The 9V battery test told me the pinout was standard case fan territory, just with a weird connector. The next question was whether the fan would actually respond to PWM control if I wired the tachometer and PWM pins to my motherboard. 9V 电池测试告诉我,它的引脚定义其实就是标准的机箱风扇,只是接头很奇怪。接下来的问题是,如果我把测速和 PWM 引脚连接到主板上,风扇是否真的会响应 PWM 控制。

It works. The motherboard can read the RPM and the fan responds to PWM. I keep it at 10%. It never goes above 50C even at full load, and I cannot really hear it. Now I just needed a proper cable instead of jumper wires held in by hope. 成功了。主板可以读取转速,风扇也能响应 PWM 控制。我把它保持在 10% 的转速,即使满载温度也不会超过 50°C,而且我几乎听不到声音。现在我只需要一根正经的线,而不是靠运气固定的跳线。

The solution was a 2.54mm male to PH2.0 female jumper cable. The female PH2.0 end plugs into the fan’s tachometer and PWM pins, and the male 2.54mm end goes into a spare fan header on the motherboard. That went from 82dB ear damage to something I can actually live with. 解决方案是一根 2.54mm 公头转 PH2.0 母头的跳线。PH2.0 母头端插入风扇的测速和 PWM 引脚,2.54mm 公头端插入主板上空闲的风扇接口。噪音从 82 分贝的“听力损伤级”变成了我完全可以接受的水平。

Doubling VRAM for cheap

低成本翻倍显存

With the fan situation handled, the V100 slotted right in alongside my 4080. 随着风扇问题解决,V100 顺利地与我的 4080 并排安装在一起:

  • RTX 4080: 16GB VRAM, Ada architecture

  • Tesla V100: 16GB VRAM, Volta architecture

  • Total: 32GB VRAM across two GPUs

  • RTX 4080: 16GB 显存,Ada 架构

  • Tesla V100: 16GB 显存,Volta 架构

  • 总计: 两块显卡共 32GB 显存

llama.cpp can split the model across both GPUs using tensor splitting. It pipelines the layers across the PCIe bus so the 4080 handles some layers and the V100 handles the rest. It is not as fast as having a single GPU with 32GB, but it works, and it cost me roughly 10% of what a 32GB GPU would cost. llama.cpp 可以通过张量拆分(tensor splitting)将模型分配到两块显卡上。它通过 PCIe 总线对层进行流水线处理,让 4080 处理一部分层,V100 处理剩下的部分。虽然它不如单张 32GB 显存的显卡快,但它确实能用,而且成本仅为 32GB 显卡价格的 10% 左右。