This article introduces ScarfBench, a benchmark designed to evaluate the performance of AI agents in migrating enterprise Java applications between different frameworks. The study highlights the complexity of framework migration and proposes a standardized evaluation method to assess agent capabilities in this domain.

  • ScarfBench provides a comprehensive dataset of real-world enterprise Java codebases for testing migration accuracy.
  • It measures key metrics such as code correctness, performance retention, and development effort reduction.
  • The benchmark includes multiple popular Java frameworks, including Spring Boot, Jakarta EE, and Micronaut.
  • Evaluation results show significant variation in AI agent performance across different framework pairs.

The authors argue that ScarfBench is essential for guiding the development of more reliable AI tools for enterprise software modernization.