Autodata: An agentic data scientist to create high quality synthetic data

The authors introduce Autodata, a general method that enables AI agents to function as data scientists for building high-quality training and evaluation datasets. The approach involves meta-optimizing these agents so they learn to generate increasingly stronger data through a process called Agentic Self-Instruct. Experiments were conducted across computer science research tasks, legal reasoning, and mathematical object reasoning. Results demonstrate that this agentic creation method yields improved performance compared to classical synthetic dataset creation techniques. Furthermore, the meta-optimization of the data scientist agent itself delivers an even larger performance uplift. This work illustrates how increased inference compute can be converted into higher quality model training data. The authors suggest this direction has the potential to fundamentally change how AI data is built.