Constraint Decay: The Fragility of LLM Agents in Back End Code Generation

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

约束衰减:大模型智能体在后端代码生成中的脆弱性

Abstract: Large Language Model (LLM) agents demonstrate strong performance in autonomous code generation under loose specifications. However, production-grade software requires strict adherence to structural constraints, such as architectural patterns, databases, and object-relational mappings. Existing benchmarks often overlook these non-functional requirements, rewarding functionally correct but structurally arbitrary solutions.

摘要: 大语言模型(LLM)智能体在宽松的规范下展现出了强大的自主代码生成能力。然而,生产级软件需要严格遵守结构性约束,例如架构模式、数据库和对象关系映射(ORM)。现有的基准测试往往忽视了这些非功能性需求,导致那些功能正确但结构随意的解决方案反而获得高分。

We present a systematic study evaluating how well agents handle structural constraints in multi-file backend generation. By fixing a unified API contract across 80 greenfield generation tasks and 20 feature-implementation tasks spanning eight web frameworks, we isolate the effect of structural complexity using a dual evaluation with end-to-end behavioral tests and static verifiers.

我们进行了一项系统性研究,评估智能体在多文件后端代码生成中处理结构性约束的能力。通过在跨越八个 Web 框架的 80 个全新开发任务和 20 个功能实现任务中固定统一的 API 契约,我们利用端到端行为测试和静态验证器进行双重评估,从而隔离并分析了结构复杂性的影响。

Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline. Capable configurations lose 30 points on average in assertion pass rates from baseline to fully specified tasks, while some weaker configurations approach zero.

研究结果揭示了一种“约束衰减”现象:随着结构性要求的增加,智能体的性能表现出显著下降。从基准任务到完全规范的任务,能力较强的配置在断言通过率上平均下降了 30 个百分点,而一些较弱的配置则接近于零。

Framework sensitivity analysis exposes significant performance disparities: agents succeed in minimal, explicit frameworks (e.g., Flask) but perform substantially worse on average in convention-heavy environments (e.g., FastAPI, Django). Finally, error analysis identifies data-layer defects (e.g., incorrect query composition and ORM runtime violations) as the leading root causes. This work highlights that jointly satisfying functional and structural requirements remains a key open challenge for coding agents.

框架敏感性分析揭示了显著的性能差异:智能体在极简、显式的框架(如 Flask)中表现良好,但在惯例驱动(convention-heavy)的环境(如 FastAPI、Django)中平均表现明显较差。最后,错误分析指出数据层缺陷(如错误的查询组合和 ORM 运行时违规)是导致失败的主要根源。这项工作强调,同时满足功能性和结构性需求仍然是代码生成智能体面临的一个关键挑战。