ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

ScarfBench：企业级 Java 框架迁移 AI 智能体基准测试

Modernizing enterprise applications is one of the largest and most expensive software engineering activities organizations undertake. Teams migrate applications across frameworks to improve maintainability, cloud readiness, developer productivity, and access to modern capabilities. Recent advances in coding agents have sparked excitement around AI-assisted modernization. But an important question remains: Can AI agents reliably modernize real-world enterprise applications? 现代化企业应用程序是组织所承担的规模最大、成本最高的软件工程活动之一。团队通过跨框架迁移应用程序，以提高可维护性、云就绪性、开发人员生产力并获取现代化的功能。近期编码智能体（Coding Agents）的进步引发了人们对 AI 辅助现代化的热议。但一个重要的问题依然存在：AI 智能体能否可靠地实现真实企业级应用程序的现代化？

Existing software engineering benchmarks have demonstrated impressive progress in bug fixing and code generation, but framework migration presents a fundamentally different challenge. Success requires not only translating code, but also preserving behavior, adapting build systems, and navigating runtime dependencies. To address this gap, we introduce ScarfBench (Self-Contained Application Refactoring Benchmark), an open benchmark for evaluating AI agents on cross-framework migration tasks in Enterprise Java. 现有的软件工程基准测试在漏洞修复和代码生成方面已取得了令人瞩目的进展，但框架迁移提出了一个本质上不同的挑战。成功不仅需要翻译代码，还需要保持行为一致、适配构建系统以及处理运行时依赖。为了填补这一空白，我们推出了 ScarfBench（自包含应用程序重构基准），这是一个用于评估 AI 智能体在企业级 Java 跨框架迁移任务中表现的开源基准测试。

ScarfBench focuses on migrations across three major Java ecosystems: Spring, Jakarta EE, and Quarkus. Unlike traditional benchmarks that compare generated code against reference implementations, ScarfBench evaluates whether migrated applications actually build, deploy, and preserve behavior. ScarfBench 专注于三大主流 Java 生态系统之间的迁移：Spring、Jakarta EE 和 Quarkus。与将生成的代码与参考实现进行比较的传统基准测试不同，ScarfBench 评估的是迁移后的应用程序是否能够真正构建、部署并保持行为一致。

Why Migration Is Hard

为什么迁移如此困难

Framework migration is much more than replacing annotations. A simple repository migration can require changes across dependency injection, persistence configuration, queries, and framework descriptors. Small mistakes in any of these pieces can prevent successful deployment. 框架迁移远不止是替换注解那么简单。一个简单的仓库迁移可能需要对依赖注入、持久化配置、查询语句和框架描述符进行全面修改。其中任何环节的微小错误都可能导致部署失败。

Introducing ScarfBench

引入 ScarfBench

ScarfBench provides a systematic way to evaluate AI agents on enterprise Java framework migration tasks. Applications are required to: Build successfully, Deploy correctly, and Pass behavioral validation. This provides a much more realistic measure of modernization quality. ScarfBench 提供了一种系统化的方法来评估 AI 智能体在企业级 Java 框架迁移任务中的表现。应用程序必须满足：成功构建、正确部署，并通过行为验证。这为衡量现代化质量提供了一个更加现实的标准。

Benchmark at a Glance

基准测试概览

Applications: 34
Framework implementations: 102
Migration tasks: 204
Lines of code: ~151K
Source and test files: ~2,000
Expert-written tests: 1,331
应用程序： 34 个
框架实现： 102 个
迁移任务： 204 个
代码行数： 约 15.1 万行
源代码和测试文件： 约 2,000 个
专家编写的测试用例： 1,331 个

ScarfBench includes both focused migration tasks and whole-application migrations. Starting from a JSR-based enterprise Java taxonomy, expert migrations create verified implementations across Spring, Jakarta EE, and Quarkus. ScarfBench 既包含针对性的迁移任务，也包含全应用迁移。基于 JSR（Java 规范提案）的企业级 Java 分类法，专家通过迁移创建了跨 Spring、Jakarta EE 和 Quarkus 的验证实现。

How Do Frontier Agents Perform?

前沿智能体的表现如何？

We evaluated several state-of-the-art coding agents on ScarfBench. Despite strong performance on traditional software engineering benchmarks, framework migration remains difficult. Success rates vary considerably across framework pairs and whole-application migrations remain particularly challenging. 我们在 ScarfBench 上评估了多个最先进的编码智能体。尽管它们在传统的软件工程基准测试中表现强劲，但框架迁移依然困难重重。不同框架对之间的成功率差异巨大，且全应用迁移仍然极具挑战性。

Even the strongest current agents achieve less than 10% behavioral success, illustrating the gap between generating compilable code and preserving application behavior. Compile success consistently exceeds deploy success, which in turn exceeds behavioral success. Build success alone significantly overestimates migration quality. 即使是目前最强的智能体，其行为成功率也低于 10%，这说明了生成可编译代码与保持应用程序行为一致之间存在巨大差距。编译成功率始终高于部署成功率，而部署成功率又高于行为成功率。仅凭构建成功来衡量，会显著高估迁移的质量。

What We Learned About AI Agents for Java Modernization

关于 Java 现代化 AI 智能体，我们学到了什么

Beyond measuring success rates, ScarfBench helps us understand how agents behave during modernization. 除了衡量成功率之外，ScarfBench 还帮助我们了解智能体在现代化过程中的行为模式。

Finding: Agents Are Overconfident 发现：智能体过于自信 Claude Code reported successful builds for 29 out of 30 whole applications. Only 22 of those applications actually built successfully. Meanwhile, the single application classified as failed by the agent ultimately built correctly. This suggests that agent self-assessment should not be treated as a reliable signal of migration completion. Independent build and test validation remains essential. Claude Code 报告称 30 个全应用中有 29 个构建成功，但实际上只有 22 个成功构建。与此同时，唯一一个被智能体判定为失败的应用程序最终却构建成功了。这表明，智能体的自我评估不能被视为迁移完成的可靠信号。独立的构建和测试验证仍然至关重要。

Finding: Migration Is Iterative Rather Than Linear 发现：迁移是迭代的，而非线性的 The most frequently visited layers were: Configuration, Web, Database, and Service. This suggests that migration is an iterative dependency-resolution process rather than a simple source-to-source transformation. 智能体访问最频繁的层级包括：配置、Web、数据库和服务。这表明迁移是一个迭代的依赖解析过程，而不是简单的源到源转换。

Finding: Configuration Dominates Migration Effort 发现：配置占据了迁移工作的主要部分 Rather than proceeding linearly, agents repeatedly returned to configuration-related artifacts while resolving framework differences and dependency issues. 智能体并非线性推进，而是在解决框架差异和依赖问题时，反复回到与配置相关的工件上。

Finding: Environment and Tooling Matter 发现：环境和工具至关重要 Agents frequently struggled with environmental issues, including Docker cache inconsistencies, port connectivity problems, and Maven wrapper and build tooling issues. 智能体经常在环境问题上受阻，包括 Docker 缓存不一致、端口连接问题以及 Maven 包装器和构建工具问题。