TxBench-PP is a verifiable benchmark for small-molecule preclinical pharmacology, testing AI agents' ability to derive accurate conclusions from real-world assay data. Across 16 model configurations, no system reliably passed all evaluations, with the best performing setup (Claude Opus 4.8 / Pi) achieving 59.3% success rate on 300 endpoint attempts.
TxBench-PP: AI Agent Benchmark in Preclinical Pharmacology
from English