Twelve Ways to Be Wrong About AI-Assisted Coding

关于 AI 辅助编程的十二种错误认知

Suppose your manager asks you next week to demonstrate that the AI coding tools your company signed up for are worth the subscription cost. Would you measure lines of code generated, or tickets closed? Or would you send out a survey asking whether developers feel more productive? Each of those approaches is flawed in a different way; the sections below explain why. 假设下周你的经理要求你证明公司购买的 AI 编程工具物有所值。你会去统计生成的代码行数，还是关闭的任务单数量？又或者，你会发一份问卷调查，询问开发人员是否感觉效率提高了？这些方法各有各的缺陷；下文将逐一解释原因。

Note: this post is about how people are assessing AI, not at LLM-assisted coding itself; with a little rewording, these criticisms could be applied to a lot of the claims that have been made about agile development, test-driven development, and other practices. If I’ve learned anything in the last twenty years, it’s that software engineering would be a lot further ahead today if we had been willing to let our peers in the human sciences teach us how to study these kinds of things properly. 注：本文讨论的是人们如何评估 AI，而非大模型（LLM）辅助编程本身；稍加改动，这些批评同样适用于关于敏捷开发、测试驱动开发及其他实践的许多论断。如果说过去二十年我学到了什么，那就是如果我们愿意让社会科学领域的同行教我们如何正确地研究这些事物，软件工程今天本可以取得更大的进步。

Counting Lines of Code Generated

统计生成的代码行数

Proxy metrics stand in for concepts that are hard to measure directly, and lines of code is one of the oldest. LLMs generate more code, but not necessarily better outcomes: a team that sees a 40% increase in lines of code per developer after adopting LLM tools has measured verbosity, not productivity. Deleting 2000 lines of tangled logic and replacing it with 200 clean ones is an improvement that looks like a loss on this metric [Sadowski2019]. More code also means more to read, maintain, and debug, and AI’s contribution to that future burden does not appear in the line count. 代理指标是难以直接衡量的概念的替代品，而代码行数是最古老的指标之一。大模型生成的代码更多，但并不一定带来更好的结果：一个团队在采用 AI 工具后，如果人均代码行数增加了 40%，他们衡量的是冗长程度，而非生产力。删除 2000 行纠缠不清的逻辑并用 200 行整洁的代码取而代之，这是一种进步，但在该指标下却表现为倒退 [Sadowski2019]。更多的代码也意味着需要阅读、维护和调试的内容更多，而 AI 对未来负担的贡献并没有体现在行数统计中。

Timing Artificial Tasks

计时人工任务

A widely cited study found that developers who used GitHub Copilot completed a task 55% faster than those who did not [Peng2023]. The task was implementing an HTTP server in JavaScript from scratch, in ninety minutes; the developers had no other obligations that day. Real software development involves navigating a large codebase you did not write, understanding a requirement described ambiguously in a ticket, coordinating with colleagues, and attending meetings. Speed on a greenfield toy task does not predict speed on any of that. A randomized controlled trial with experienced open-source developers found the opposite of what participants themselves predicted: giving them access to AI tools increased task completion time by 19% [Becker2025]. 一项被广泛引用的研究发现，使用 GitHub Copilot 的开发人员完成任务的速度比未使用者快 55% [Peng2023]。该任务是在 90 分钟内用 JavaScript 从零开始实现一个 HTTP 服务器；开发人员当天没有其他工作负担。而真实的软件开发涉及浏览非自己编写的大型代码库、理解任务单中描述模糊的需求、与同事协调以及参加会议。在“绿地”玩具任务上的速度并不能预测在上述任何实际工作中的速度。一项针对资深开源开发者的随机对照试验发现的结果与参与者自己的预测恰恰相反：提供 AI 工具反而使任务完成时间增加了 19% [Becker2025]。

Before/After With No Control Group

没有对照组的“前后对比”

You start using LLMs in January; by June, pull requests are shipping faster, so the tools must be working, right? But between January and June you hired twelve engineers, refactored the CI pipeline, and switched your cloud provider. Without a group that did not adopt the tools, you cannot separate the effect of LLMs from any of the other changes that happened at the same time. Internal validity requires a credible counterfactual, i.e., some way of knowing what would have happened otherwise. 你在一月份开始使用大模型；到了六月，合并请求（PR）的交付速度变快了，所以工具一定起作用了，对吧？但在这一月到六月之间，你雇佣了 12 名工程师，重构了 CI 流水线，还更换了云服务商。如果没有一个未采用该工具的对照组，你就无法将大模型的影响与同时发生的其他变化区分开来。内部有效性需要一个可信的“反事实”，即某种方式来了解如果不采取这些措施会发生什么。

Asking Developers If They Feel More Productive

询问开发人员是否感觉效率更高

Survey results like “87% of developers report feeling more productive with AI tools” are regularly cited as evidence that the tools work [Liang2024], but three things make self-report systematically misleading: The Hawthorne effect means people work differently when they know they are being observed and evaluated; The novelty effect means new tools feel faster because they are novel, and that feeling typically fades within weeks; and Social desirability bias means respondents tend to say what they believe the survey wants to hear, especially when management chose the tool. 诸如“87% 的开发人员报告称使用 AI 工具后感觉效率更高”之类的调查结果经常被引用为工具有效的证据 [Liang2024]，但有三点使得自我报告具有系统性的误导性：霍桑效应意味着当人们知道自己被观察和评估时，工作方式会发生改变；新奇效应意味着新工具因为新鲜感而让人感觉更快，这种感觉通常在几周内就会消失；社会赞许性偏差意味着受访者倾向于说出他们认为调查者想听的话，尤其是当工具是由管理层选定时。

Counting Commits, Pull Requests, and Tickets

统计提交、合并请求和任务单

In 2023, McKinsey proposed measuring individual developer productivity using counts of commits, pull requests, code reviews, and similar activities [McKinsey2023]. Goodhart’s Law states that when a measure becomes a target, it ceases to be a good measure [Goodhart1984]. When developers know their commit count is tracked, they make more, smaller commits; when ticket counts are tracked, tickets get split. The numbers improve while the underlying work does not [Beck2023]. Activity is not output; output is not value. 2023 年，麦肯锡提议通过统计提交、合并请求、代码审查及类似活动的数量来衡量个人开发者的生产力 [McKinsey2023]。古德哈特定律指出，当一个指标变成目标时，它就不再是一个好的指标了 [Goodhart1984]。当开发人员知道他们的提交次数被追踪时，他们会进行更多、更小的提交；当任务单数量被追踪时，任务单就会被拆分。数据变好看了，但实际工作并没有 [Beck2023]。活动不等于产出；产出不等于价值。

Measuring Only the Easy Half

只衡量容易的那一半

LLMs make code generation faster, and that half is easy to measure. The other half is harder: time spent reviewing LLM-generated code for correctness, time lost debugging confidently wrong suggestions, security vulnerabilities introduced by plausible-looking but insecure code, and technical debt from suggestions that solved the immediate problem while ignoring the surrounding design. A study of GitHub Copilot’s code found that a substantial fraction of generated code contained security vulnerabilities, and that developers under time pressure accepted insecure suggestions at higher rates [Pearce2022]. A 2025 evaluation of five major LLMs found that none produced web application code meeting industry security standards [Dora2025]. A large-scale analysis of over 300,000 AI-authored commits found that more than 15% introduce at least one quality issue, and nearly a quarter of those issues persist in the codebase long-term [Liu2026]. Measuring only the inputs that go up while ignoring the costs that also rise is not measurement; it is marketing. 大模型加快了代码生成速度，而这一半很容易衡量。另一半则更难：审查大模型生成的代码以确保正确性所花费的时间、调试那些“自信地胡说八道”的建议所浪费的时间、由看似合理但不安全的代码引入的安全漏洞，以及那些解决了眼前问题却忽略了周边设计的建议所带来的技术债。一项针对 GitHub Copilot 代码的研究发现，很大一部分生成的代码包含安全漏洞，且在时间压力下的开发人员更容易接受不安全的建议 [Pearce2022]。2025 年对五大主流大模型的一项评估发现，没有一个模型能生成符合行业安全标准的 Web 应用代码 [Dora2025]。一项针对超过 30 万次 AI 编写的提交的大规模分析发现，超过 15% 的提交引入了至少一个质量问题，且近四分之一的问题在代码库中长期存在 [Liu2026]。只衡量上升的投入而忽略同时上升的成本，这不叫衡量，这叫营销。

Treating Adoption Rate as a Success Metric

将采用率视为成功指标

“We have achieved 90% AI tool adoption across engineering” is a procurement outcome, not a productivity outcome. Adoption measures whether the tool is installed and opened; it says nothing about whether suggestions are useful, whether developers accept them thoughtlessly, or whether the accepted suggestions are correct. High adoption combined with low suggestion quality produces a workforce spending time managing a tool rather than benefiting from one. A study of IBM’s enterprise AI coding assistant found that while the tool often provided net productivity increases, those… “我们在工程部门实现了 90% 的 AI 工具采用率”这只是采购成果，而非生产力成果。采用率衡量的是工具是否被安装和打开；它无法说明建议是否有用、开发人员是否不假思索地接受了它们，或者被接受的建议是否正确。高采用率加上低质量的建议，只会导致员工花费时间去管理工具，而不是从中受益。一项针对 IBM 企业级 AI 编程助手的研究发现，虽然该工具通常能带来净生产力提升，但这些……