GeneBench-Pro is a research-level benchmark designed to measure how AI agents handle ambiguity and make consequential judgments in computational biology, expanding on the original GeneBench. It addresses the limitation of current assessments by testing higher-order capabilities such as handling data noise, revising assumptions, and determining when results are decision-ready.

  • The benchmark consists of 129 synthetically generated questions covering genomics, quantitative biology, and translational medicine, ensuring deterministic grading against known causal structures.
  • Each problem provides a realistic dataset with technical issues, requiring agents to explore data, choose analytical approaches, and engage in iterative experimentation.
  • External domain experts reviewed the problems for realism and appropriateness, noting that they are challenging enough to require thoughtful analysis rather than simple application of off-the-shelf methods.
  • GPT-5.6 Sol achieved a pass rate of 28.7% at the highest reasoning level, with performance increasing to 31.5% when Pro mode is enabled.
  • The results indicate that scaling test-time compute significantly improves performance, with GPT-5.6 Sol solving nearly six times as many questions as GPT-5.2 while using fewer tokens.

The benchmark highlights the growing gap between frontier models and open-source systems in high-level scientific reasoning under uncertainty, suggesting that AI assistance could improve the pace and reproducibility of biological research.