The author introduces Surface Evolver Bench, a custom benchmark designed to evaluate large language models' ability to write complex physical simulations using the Surface Evolver tool. This tool, released in 1992, models liquid surfaces by requiring users to define custom datafiles containing vertices, edges, faces, bodies, constraints, energies, and boundary integrals.

  • gpt5.5 is identified as the best model overall, being the only one to solve several tasks.
  • glm5.2 is noted as the best open-source model for this benchmark.
  • The benchmark utilizes a natural agentic loop involving documentation consultation, implementation, simulation running, and debugging.

This evaluation highlights the capability of current models to handle intricate, domain-specific coding tasks that require iterative debugging and adherence to complex specifications.