Why I Migrated From GPT-4o to DeepSeek — A Backend Engineer's Notes
Why I Migrated From GPT-4o to DeepSeek — A Backend Engineer’s Notes
Why I Migrated From GPT-4o to DeepSeek — A Backend Engineer’s Notes 六个月前,我的 OpenAI 每月账单突破了四位数,我终于忍无可忍了。这并不是因为从绝对金额上看费用无法承受,而是因为我有一种强烈的预感:我为了微小的质量提升支付了过高的溢价。于是,我做了任何一个理性的后端工程师都会做的事:我为服务添加了按端点记录 Token 使用量的功能,启动了针对各大中国主流模型的并行调用,并开始像对待薪水一样认真地对比数据。剧透一下——这确实关系到我的“薪水”。
Six months ago, my monthly OpenAI bill crossed four figures and I finally snapped. Not because the cost was unbearable in absolute terms, but because I had a sneaking suspicion I was overpaying for marginal quality gains. So I did what any sane backend engineer would do: I instrumented my service to log token usage by endpoint, spun up parallel calls to every major Chinese model, and started comparing numbers like my paycheck depended on it. Spoiler — it kind of did.
这是我将中国 AI 模型(DeepSeek、Qwen、Kimi、GLM)与美国主流模型(GPT-4o、Claude 3.5 Sonnet、Gemini 1.5 Pro)在真实生产负载下进行正面交锋后的发现。这不是合成基准测试,也不是基于主观感受的 Twitter 讨论,而是流经我服务的真实请求。说实话,结果出乎我的意料。
This is the story of what I found when I actually ran Chinese AI models (DeepSeek, Qwen, Kimi, GLM) head-to-head against the US incumbents (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) on a real production workload. Not a synthetic benchmark, not a vibes-based Twitter thread — actual requests flowing through my service. Fwiw, the results were not what I expected.
The Pricing Problem Nobody Wants to Talk About
没人愿意谈论的价格问题
让我们从 CFO 们关心的问题开始。2026 年,中美模型之间的价格差距不是误差范围,而是一道巨大的鸿沟。以下是我目前(或原本)每百万 Token 的支付价格:
Let’s start with the part CFOs care about. The price gap between US and Chinese models in 2026 isn’t a rounding error — it’s a yawning chasm. Here’s what I’m currently paying (or would pay) per million tokens:
| Model | Origin | Input $/M | Output $/M | Multiplier vs DeepSeek V4 Flash |
|---|---|---|---|---|
| DeepSeek V4 Flash | 🇨🇳 | $0.18 | $0.25 | 1× (baseline) |
| Qwen3-32B | 🇨🇳 | $0.18 | $0.28 | 1.1× |
| GPT-4o-mini | 🇺🇸 | $0.15 | $0.60 | 2.4× |
| Kimi K2.5 | 🇨🇳 | $0.59 | $3.00 | 12× |
| GLM-5 | 🇨🇳 | $0.73 | $1.92 | 7.7× |
| Gemini 1.5 Pro | 🇺🇸 | $1.25 | $5.00 | 20× |
| GPT-4o | 🇺🇸 | $2.50 | $10.00 | 40× |
| Claude 3.5 Sonnet | 🇺🇸 | $3.00 | $15.00 | 60× |
六十倍。好好消化一下这个数字。Claude 3.5 Sonnet 的输出价格是 DeepSeek V4 Flash 的 60 倍。对于我的工作负载——主要是中短篇幅的分类和提取任务——这意味着每月 40 美元和 2400 美元的区别。同样的语料库,同样的提示词,同样的下游业务逻辑。人们的第一反应通常是“一分钱一分货”。但这站得住脚吗?让我展示一下数据。
Sixty times. Let that marinate. Claude 3.5 Sonnet’s output pricing is 60× more than DeepSeek V4 Flash. For my workload — heavy on short-to-medium classification and extraction calls — that’s the difference between $40/month and $2,400/month. Same corpus, same prompts, same downstream business logic. The knee-jerk reaction is “yeah but you get what you pay for.” Does that hold up? Let me show you the numbers.
Benchmark Numbers, Because Vibes Don’t Ship to Production
基准测试数据,因为“感觉”无法支撑生产环境
我整理了作为后端工程师最关心的三个类别的社区平均得分:通用推理(MMLU 风格)、代码生成(HumanEval)和中文能力(C-Eval)。这些数据是近似值——你的实际体验绝对会因提示词格式、温度参数以及你是否正确转义了 JSON 而有所不同。但在我看来,它们清晰地说明了问题。
I pulled community-average scores for the three categories I care about as a backend engineer: general reasoning (MMLU-style), code generation (HumanEval), and Chinese-language performance (C-Eval). These are approximate — your mileage will absolutely vary based on prompt format, temperature, and whether you remembered to escape your JSON properly. Imo, they paint a clear picture regardless.
General Reasoning (通用推理)
| Model | MMLU-style Score | Output $/M |
|---|---|---|
| Claude 3.5 Sonnet | 89.0 | $15.00 |
| GPT-4o | 88.7 | $10.00 |
| Qwen3.5-397B | 87.5 | $2.34 |
| Kimi K2.5 | 87.0 | $3.00 |
| GLM-5 | 86.0 | $1.92 |
| DeepSeek V4 Flash | 85.5 | $0.25 |
这里最好和最差之间的差距大约是 3.5 分。这并非微不足道,但也绝不是 60 倍的差距。在底层,大多数模型都在向相同的“训练数据+RLHF”平台期收敛,差异更多源于微调细节,而非根本性的能力差距。
The spread between the best and worst here is about 3.5 points. That’s not nothing, but it’s also not 60× of anything. Under the hood, most of these models are converging on the same training-data-plus-RLHF plateau, and the differences come down to fine-tuning specifics rather than fundamental capability gaps.
Code Generation (HumanEval) (代码生成)
| Model | Score | Output $/M |
|---|---|---|
| Claude 3.5 Sonnet | 93.0 | $15.00 |
| GPT-4o | 92.5 | $10.00 |
| DeepSeek V4 Flash | 92.0 | $0.25 |
| Qwen3-Coder-30B | 91.5 | $0.35 |
| DeepSeek Coder | 91.0 | $0.25 |
这是让我第一次看到时忍不住笑出声的部分。DeepSeek V4 Flash 在 HumanEval 上的得分与 GPT-4o 仅差不到 1 分,而输出 Token 的价格却便宜了 40 倍。而专门为代码任务构建的 DeepSeek Coder 变体,以 91.0 的得分紧随其后,价格同样仅为 0.25 美元/百万 Token。如果你在处理代码相关的工作负载时没有使用这些模型,那你就是在白白浪费钱。
This is the section that made me audibly laugh when I first saw it. DeepSeek V4 Flash scores within one point of GPT-4o on HumanEval while charging 40× less for output tokens. And the specialized DeepSeek Coder variant — built specifically for this task — is a hair behind at 91.0 for the same $0.25/M. If you’re not using these for code-adjacent workloads, you’re leaving real money on the table.
Chinese Language (C-Eval) (中文能力)
| Model | Score | Output $/M |
|---|---|---|
| GLM-5 | 91.0 | $1.92 |
| Kimi K2.5 | 90.5 | $3.00 |
| Qwen3-32B | 89.0 | $0.28 |
| GPT-4o | 88.5 | $10.00 |
| DeepSeek V4 Flash | 88.0 | $0.25 |
不出所料,在中文语料上训练的模型在中文评估中表现更好。GLM-5 和 Kimi K2.5 领跑榜单,Qwen3-32B 以 0.28 美元/百万 Token 的价格展现了极高的性价比。即使是定位为通用模型的 DeepSeek V4 Flash,在 C-Eval 上也击败了 GPT-4o,而且价格便宜了 40 倍。
Shocking absolutely no one, models trained on Chinese corpora perform better on Chinese-language evaluations. GLM-5 and Kimi K2.5 top this list, with Qwen3-32B punching far above its weight at $0.28/M. Even DeepSeek V4 Flash, which is positioned as a generalist, beats GPT-4o on C-Eval — for 40× less money.
The Real Moat: Access, Not Quality
真正的护城河:接入方式,而非质量
这里我必须说句实话。仅凭基准测试选择中国模型很容易,但真正部署它们呢?那才是摩擦力所在。障碍不在于技术,而在于商业和监管:
Here’s where I have to get real for a second. Picking Chinese models based on benchmarks alone is easy. Actually deploying them? That’s where the friction lives. The obstacles aren’t technical — they’re commercial and regulatory:
| Concern | US Models | Chinese | Direct Global API |
|---|---|---|---|
| Payment | Credit card ✅ | WeChat/Alipay ❌ | PayPal + cards ✅ |
| Signup | Email ✅ | Chinese phone # ❌ | Email ✅ |
| Wire format | OpenAI-compatible ✅ | Custom per provider ❌ | OpenAI-compatible ✅ |
| Geo-restrictions | None ✅ | Often blocked ❌ | None ✅ |
| Docs language | English ✅ | Mostly Chinese ❌ | English ✅ |
| Support | English ✅ | Chinese ❌ | Both ✅ |
| Currency | USD ✅ | CNY only ❌ | USD ✅ |
2026 年,中国模型的主要障碍不是模型质量——那基本上已经是一个被解决的问题。真正的障碍在于繁琐的运营开销:注册账号、完成验证、支付费用,以及处理来自不同供应商的 N 种 SDK 怪癖。在底层,大多数中国供应商甚至不使用相同的通信格式,这意味着你需要维护 N 种客户端实现。RFC 7231 肯定不会批准这种做法。这就是为什么我最终选择通过 Global API 进行路由——它为我提供了兼容 OpenAI 的端点、美元结算和 PayPal 支持,这意味着我可以在不触碰应用程序代码的情况下进行 A/B 测试。
The primary barrier to Chinese models in 2026 isn’t model quality — that’s basically a solved problem. It’s the sheer operational overhead of getting an account, getting verified, getting paid, and then dealing with N different SDK quirks from N different providers. Under the hood, most Chinese providers don’t even speak the same wire format, which means you’d need to maintain N client implementations. RFC 7231 wouldn’t approve. That’s why I ended up routing everything through Global API — it gives me OpenAI-compatible endpoints, USD billing, and PayPal support, which means I can A/B test providers without touching my application code.
Code Example: The Drop-In Replacement
代码示例:直接替换
兼容 OpenAI 的 API 最美妙的地方在于:在大多数代码库中,切换供应商仅仅是一行配置的改动。以下是我服务的一个简化版本:
Here’s the beautiful thing about OpenAI-compatible APIs. Switching providers is literally a one-line config change in most codebases. Here’s a simplified version of what my service looks like:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("GLOBAL_API_KEY"),
base_url="https://global-apis.com/v1",
)
def classify_ticket(text: str) -> dict:
response = client.chat.completions.create(
model="deepseek-v4-flash", # swap to gpt-4o, claude-3.5-sonnet, etc.
messages=[
{"role": "system", "content": "Classify the support ticket. Return JSON."},
{"role": "user", "content": text},
],
response_format={"type": "json_object"},
temperature=0.0,
)
return response.choices[0].message.content
我针对 gpt-4o、deep…(原文截断)运行完全相同的代码路径。
I run the exact same code path against gpt-4o, deep…