Chat is Dead: How JSON Prompting Cut My AI Costs by 73%

Chat is Dead: How JSON Prompting Cut My AI Costs by 73%

聊天已死:JSON Prompting 如何将我的 AI 成本降低 73%

I burned $2,400 in 3 weeks talking to AI like a human. For 18 months, I built AI features the “normal” way: conversational prompts, friendly instructions, “please” and “thank you” sprinkled in. It worked—until our user base 10x’d in January. Our monthly OpenAI bill went from $800 to $4,100. Same features. Same users. Just more conversations. That’s when I discovered JSON prompting. Not as a nice-to-have. As a survival requirement. Three weeks after migrating our entire stack, our bill dropped to $1,107. A 73% reduction. Here’s the exact system. 在过去三周里,我像对待人类一样与 AI 对话,结果烧掉了 2,400 美元。在过去的 18 个月里,我一直以“常规”方式构建 AI 功能:对话式提示词、友好的指令,还穿插着“请”和“谢谢”。这套方法原本很有效,直到一月份我们的用户群增长了 10 倍。我们的 OpenAI 月账单从 800 美元飙升至 4,100 美元。功能没变,用户没变,只是对话变多了。就在那时,我发现了 JSON Prompting。它不再是锦上添花,而是生存必需品。在迁移整个技术栈三周后,我们的账单降至 1,107 美元,降幅达 73%。以下是具体的实施方案。

Why chat interfaces are a tax on engineering

为什么聊天界面是工程上的“隐形税”

Traditional prompting looks like this: 传统的提示词通常是这样的:

const prompt = `
  You're a helpful assistant. Please extract the user's name, email, and company from this text. 
  Be polite and return the data in a friendly format. 
  Text: ${userInput}
`;

The problems: 存在的问题:

  • Unpredictable output: Sometimes JSON, sometimes markdown, sometimes an apology.
  • Token bloat: “Please,” “helpful,” “friendly” = 12 wasted tokens per call.
  • Parser hell: JSON.parse() fails 23% of the time (my actual metric).
  • No schema validation: You find out it’s broken in production.
  • 输出不可预测: 有时是 JSON,有时是 Markdown,有时甚至是道歉。
  • Token 臃肿: “请”、“有帮助”、“友好”等词汇在每次调用中浪费了 12 个 Token。
  • 解析地狱: JSON.parse() 有 23% 的失败率(这是我的实际统计数据)。
  • 缺乏模式验证: 你往往是在生产环境崩溃时才发现问题。

When you’re doing 500K calls/month, those 12 tokens become 6M tokens. At $0.03/1K tokens, that’s $180/month for the word “please.” 当你每月进行 50 万次调用时,那 12 个 Token 就变成了 600 万个 Token。按每 1K Token 0.03 美元计算,仅仅为了“请”这个词,你每月就要多付 180 美元。

JSON prompting: treating LLMs like APIs

JSON Prompting:像对待 API 一样对待 LLM

Here’s the same task with JSON prompting: 以下是使用 JSON Prompting 完成相同任务的方式:

const prompt = {
  "schema": {
    "name": "string",
    "email": "string (valid format)",
    "company": "string"
  },
  "instructions": "Extract from text. Return ONLY valid JSON.",
  "text": userInput
};

const response = await openai.chat.completions.create({
  model: "gpt-4-turbo",
  messages: [{ role: "user", content: JSON.stringify(prompt) }],
  response_format: { type: "json_object" }
});

Result: 100% parse success rate. Zero fluff. 34% fewer tokens. 结果:100% 的解析成功率。没有废话。Token 消耗减少了 34%。

The hidden reason it saves 73% (it’s not the tokens)

节省 73% 的隐藏原因(不仅仅是因为 Token)

Everyone focuses on token reduction. That’s the small win. The big win is eliminating retry loops. With chat prompting, my flow looked like: 每个人都关注 Token 的减少,但这只是小胜。真正的胜利在于消除了重试循环。在使用聊天式提示词时,我的流程是这样的:

  1. Send prompt → get markdown instead of JSON
  2. Retry with “return ONLY JSON” → get JSON with comments
  3. Retry again → finally get clean JSON
  4. Parse → crash on edge case
  5. Add try/catch → retry entire flow
  6. 发送提示词 → 得到 Markdown 而非 JSON
  7. 重试并要求“仅返回 JSON” → 得到带有注释的 JSON
  8. 再次重试 → 终于得到干净的 JSON
  9. 解析 → 在边缘情况崩溃
  10. 添加 try/catch → 重试整个流程

Average: 2.7 API calls per successful extraction. With JSON prompting + response_format: Send prompt → get guaranteed JSON → Parse → works. Average: 1.0 calls. That’s a 63% reduction in API calls before token savings. Combined with schema efficiency: 73% total cost cut. 平均每次成功提取需要 2.7 次 API 调用。而使用 JSON Prompting + response_format:发送提示词 → 获得保证的 JSON → 解析 → 成功。平均只需 1.0 次调用。在计算 Token 节省之前,API 调用次数就减少了 63%。结合模式效率,总成本降低了 73%。

The reasoning token trap

推理 Token 的陷阱

Here’s what nobody tells you about “thinking models” (o1, Claude 3.7, Gemini 2.0): When you enable reasoning, you’re billed for internal thoughts at input rates. I pasted 500K tokens of codebase for analysis. The model used 187K “reasoning tokens” to think about it. My bill: $18.40 for thinking, $15 for the answer. JSON prompting forces deterministic reasoning. The model doesn’t “think” in prose—it maps directly to your schema. My reasoning token usage dropped 81%. 关于“推理模型”(如 o1, Claude 3.7, Gemini 2.0),没人告诉你的是:当你启用推理功能时,系统会按输入费率对内部思考过程收费。我粘贴了 50 万个 Token 的代码库进行分析,模型使用了 18 万 7 千个“推理 Token”来思考。我的账单是:思考花费 18.40 美元,回答花费 15 美元。JSON Prompting 强制进行确定性推理。模型不会用散文去“思考”,而是直接映射到你的模式。我的推理 Token 使用量下降了 81%。

Migration: 3 files changed

迁移:只需修改 3 个文件

  • Step 1: Define schemas (schemas.js)
  • Step 2: Create prompt builder (prompt.js)
  • Step 3: Update API calls (set temperature: 0 for determinism)
  • 第一步: 定义模式 (schemas.js)
  • 第二步: 创建提示词构建器 (prompt.js)
  • 第三步: 更新 API 调用(设置 temperature: 0 以确保确定性)

The results after 21 days

21 天后的结果

MetricBeforeAfterChange
Avg tokens/call1,240820-34%
Parse failures23%0%-100%
Avg calls/task2.71.0-63%
Monthly cost$4,100$1,107-73%
P95 latency2.3s1.1s-52%
指标之前之后变化
平均 Token/调用1,240820-34%
解析失败率23%0%-100%
平均调用/任务2.71.0-63%
月度成本$4,100$1,107-73%
P95 延迟2.3s1.1s-52%

Bonus: Our error rate dropped from 1.2% to 0.03%. Support tickets about “AI acting weird” went to zero. 额外收获:我们的错误率从 1.2% 降至 0.03%。关于“AI 表现异常”的工单数量降至零。

When NOT to use JSON prompting

何时不应使用 JSON Prompting

  • Creative writing (you want the fluff)
  • Exploratory analysis (you want reasoning prose)
  • Customer-facing chat (humans like “please”)
  • 创意写作(你需要那些修饰词)
  • 探索性分析(你需要推理过程的叙述)
  • 面向客户的聊天(人类喜欢“请”这样的礼貌用语)

For everything else—data extraction, classification, transformation, API-like tasks—JSON prompting is highly effective. 对于其他所有任务——数据提取、分类、转换、类 API 任务——JSON Prompting 都非常有效。

The stack is deterministic

技术栈是确定性的

The era of “prompt engineering as conversation” is shifting. We are entering a phase where prompt engineering functions more like API design. Your prompts are schemas. Your LLMs are functions. Your costs are predictable. Start with one endpoint this week and measure the before/after. “将提示词工程视为对话”的时代正在过去。我们正在进入一个提示词工程更像 API 设计的阶段。你的提示词就是模式,你的 LLM 就是函数,你的成本是可预测的。本周先从一个端点开始,测量一下前后的变化吧。