NatureBench Evaluates AI Coding Agents' Scientific Discovery Capabilities

NatureBench presents a benchmark of 90 tasks from Nature-family papers to assess AI coding agents' ability to achieve scientific discovery. Under a web-search-disabled protocol, the top model exceeds prior state-of-the-art on only 17.8% of tasks. Agents primarily succeed by translating scientific problems into supervised learning tasks, not through original scientific invention.