A new LLM micro-benchmark evaluates how well large language models can simulate solid-liquid interfaces using Surface Evolver, a 1992 tool for modeling liquid surfaces. The benchmark requires LLMs to write SE datafiles defining geometry and constraints through an iterative agentic process with objective grading, offering a niche task with real scientific relevance and sparse training data.
My new benchmark: how good are LLMs at simulating wetting behavior?
from English