Six Choices Every AI Engineer Has to Make (and Nobody Teaches)

每位 AI 工程师都必须做出的六个选择（但没人教过）

The production trade-offs that only appear once your model is live. 这些生产环境中的权衡取舍，只有在模型上线后才会显现。

University courses teach you how to make a model accurate. They rarely teach you the decisions that come right after. How do you know when to fully automate something versus keeping a human in the loop? When does prompting stop being enough and fine-tuning become worth the cost? What does it actually mean to pick real-time inference over batch when the bill arrives? These questions don’t show up in coursework. They show up your first week in production! 大学课程教你如何提高模型精度，却很少教你随之而来的决策。你如何判断何时该完全自动化，何时该保留人工干预？何时提示工程（Prompting）不再够用，而微调（Fine-tuning）变得物有所值？当账单寄来时，选择实时推理而非批量推理意味着什么？这些问题不会出现在课程作业中，它们会在你上线的第一周就找上门来！

This article walks through 6 trade-offs that show up in production AI work. All backed by the latest research, so you get a glimpse into how people are dealing with these common trade-offs. There are no right answers here. There are useful frames, real numbers, and the kind of context that makes the next decision faster. 本文将探讨 AI 生产工作中出现的 6 种权衡。所有内容均有最新研究支持，让你一窥业界如何处理这些常见问题。这里没有标准答案，但有实用的框架、真实的数据以及能让你决策更快的背景信息。

Index

Build vs. Buy in the LLM Era (When calling an API stops making sense) 大模型时代的“自建”与“采购”（何时调用 API 不再划算）
Model Complexity vs. Maintainability (Who debugs this in 6 months?) 模型复杂度与可维护性（6 个月后谁来调试它？）
Data Quantity vs. Data Quality (More data isn’t always the answer) 数据量与数据质量（更多数据并不总是答案）
Throughput vs. Latency (Batch or real-time) 吞吐量与延迟（批量还是实时）
Prompt Engineering vs. Fine-Tuning (Two very different investment curves) 提示工程与微调（两种截然不同的投资曲线）
Automation vs. Human Oversight (How much do you trust the model to act alone?) 自动化与人工监督（你有多信任模型能独立行动？）

1. Build vs. Buy in the LLM Era: When calling an API stops making sense

1. 大模型时代的“自建”与“采购”：何时调用 API 不再划算

The old version of this question was: do we train our own model? That one is mostly settled. Almost nobody trains from scratch anymore. The 2026 version is harder. You have 3 options now: call an API, fine-tune an open-source model, or build and host your own stack. Each one has very different cost curves and very different failure modes. 这个问题的旧版本是：我们要训练自己的模型吗？这个问题基本已经定论了，现在几乎没人从头开始训练。2026 年的版本则更复杂。你现在有 3 种选择：调用 API、微调开源模型，或构建并托管自己的技术栈。每种选择都有截然不同的成本曲线和故障模式。

A 2025 Omdia survey of 376 technical and business stakeholders found that 95% agreed building gives more customization and control [1]. The same survey found 91% agreed prebuilt platforms ship faster. Both numbers are true at the same time, which is the problem. 2025 年 Omdia 对 376 位技术和业务相关人员的调查显示，95% 的人认为自建能提供更多的定制化和控制权 [1]。同一项调查发现，91% 的人认为预构建平台交付更快。这两个数据同时成立，这正是问题的所在。

Where it gets concrete is at scale. Below 100k daily requests, calling an API like GPT-4o Mini is usually the right call. Low overhead. Fast iteration. Above 1M daily requests, per-token costs start eating margin [2]. Here is the part teams undervalue. A 2024 analysis found that hardware and electricity make up only 20 to 30% of self-hosting cost. Staff is the other 70 to 80% [2]. These means that most build-vs-buy spreadsheets account for the GPUs and forget the engineers. 当规模化时，问题就变得具体了。在日请求量低于 10 万次时，调用 GPT-4o Mini 这样的 API 通常是正确的选择，开销低且迭代快。当日请求量超过 100 万次时，按 Token 计算的成本开始侵蚀利润 [2]。这是团队容易低估的部分：2024 年的一项分析发现，硬件和电力仅占自托管成本的 20% 到 30%，而人力成本占了其余的 70% 到 80% [2]。这意味着大多数“自建 vs 采购”的预算表都计算了 GPU 成本，却忽略了工程师的成本。

Another study found teams exceeded their LLM cost budgets by 340% on average. In most cases the cause was missing per-tenant usage tracking and missing query-level cost attribution, not the per-token rate itself [3]. Teams couldn’t see which feature or prompt was burning the budget, so they couldn’t fix it. 另一项研究发现，团队的 LLM 成本平均超支 340%。在大多数情况下，原因是缺乏针对租户的使用追踪和查询级别的成本归因，而不是 Token 单价本身 [3]。团队无法看到是哪个功能或提示词耗尽了预算，因此无法进行优化。

Framework lock-in shows up later and shows up hard. Hugging Face’s Text Generation Inference went into maintenance mode in late 2025, and teams who built on it had to migrate. Teams who used an API didn’t have to do anything. The practical frame I use: Start with the API. Instrument every call with cost, latency, and feature attribution from day 1. Switch when the math forces you to. 框架锁定（Framework lock-in）的问题往往在后期显现，且影响巨大。Hugging Face 的 Text Generation Inference 在 2025 年底进入维护模式，基于此构建的团队不得不进行迁移，而使用 API 的团队则无需做任何事。我采用的实用框架是：从 API 开始，从第一天起就对每次调用进行成本、延迟和功能归因的监测。当计算结果迫使你切换时，再进行切换。

2. Model Complexity vs. Maintainability: Who debugs this in 6 months?

2. 模型复杂度与可维护性：6 个月后谁来调试它？

A famous Google paper introduced the CACE principle: Changing Anything Changes Everything [4]. In ML systems, a small tweak in one part of the pipeline can trigger surprising changes elsewhere. This rarely happens with a linear regression. It happens often with ensembles and neural nets. Research on ML technical debt shows that data dependency is more expensive than code dependency [4]. 谷歌的一篇著名论文提出了 CACE 原则：改变任何东西都会改变一切 [4]。在机器学习系统中，流水线某一部分的微小调整可能会在其他地方引发意想不到的变化。这在线性回归中很少见，但在集成模型和神经网络中却经常发生。关于机器学习技术债务的研究表明，数据依赖比代码依赖更昂贵 [4]。

Why? Because data is harder to track, harder to version, and harder to explain to whoever inherits the system 6 months from now. The original paper estimated that the actual model code is a small fraction of a real-world ML system. The bulk is feature stores, pipeline logic, monitoring, retraining triggers, and the glue between all of them [5]. 为什么？因为数据更难追踪、更难版本化，也更难向 6 个月后接手系统的人解释。原始论文估计，实际的模型代码在现实世界的机器学习系统中只占很小一部分。大部分工作在于特征存储、流水线逻辑、监控、重训练触发器以及连接所有这些组件的“胶水”代码 [5]。

In practice, teams pick a more complex model for a 2% accuracy gain and pay for that choice for 18 months in debugging time, retraining overhead, and the “nobody remembers why we did this” tax. The question to ask before shipping a complex model is: who owns this in a year? If the honest answer is “unclear,” that is the decision point. 在实践中，团队为了 2% 的精度提升而选择更复杂的模型，却要在接下来的 18 个月里为调试时间、重训练开销以及“没人记得为什么要这么做”的代价买单。在发布复杂模型前要问的问题是：一年后谁来负责它？如果诚实的回答是“不清楚”，那么这就是决策的关键点。

3. Data Quantity vs. Data Quality: More data isn’t always the answer

3. 数据量与数据质量：更多数据并不总是答案

More data wins for foundation models trained on internet-scale corpora. In applied ML, the relationship breaks down much sooner. Research shows that beyond a noise threshold, adding more low-quality data flattens or degrades model performance [6]. This means that the relationship between sample size and accuracy breaks down once noise crosses a certain level! 对于在互联网规模语料库上训练的基础模型来说，数据越多越好。但在应用机器学习中，这种关系很快就会失效。研究表明，一旦超过噪声阈值，增加更多低质量数据会使模型性能趋于平稳甚至下降 [6]。这意味着一旦噪声超过一定水平，样本量与精度之间的关系就会崩溃！

The “data swamp” problem is what this looks like at companies. Teams collect everything because storage is cheap and they assume it will be useful one day. Without governance, you get a pool that takes weeks to clean, raises storage and pipeline costs, and slows experimentation without improving outcomes [7]. “数据沼泽”问题就是企业中常见的现象。团队因为存储便宜而收集一切，并假设总有一天会有用。如果没有治理，你最终会得到一个需要数周时间清理的数据池，这不仅增加了存储和流水线成本，还拖慢了实验速度，却无法改善结果 [7]。

Medical AI is the clearest case. Small datasets with expert-verified labels have repeatedly outperformed larger datasets with unreliable annotations. The model learned the right patterns from less data because the signal was clean. The question I find more useful in practice: how noisy is what we have, and what does 1 more hour of cleaning buy us versus 1 more day of collection? 医疗 AI 是最明显的例子。经过专家验证标签的小型数据集，其表现屡次优于带有不可靠标注的大型数据集。模型从较少的数据中学习到了正确的模式，因为信号是干净的。我在实践中发现更有用的问题是：我们现有的数据噪声有多大？多花 1 小时清理数据，与多花 1 天收集数据相比，哪种收益更高？

4. Throughput vs. Latency: Batch or Real-Time

4. 吞吐量与延迟：批量还是实时

Batch and real-time inference are 2 different system architectures. Picking the wrong one… 批量推理和实时推理是两种不同的系统架构。选错了的话……