SQLConductor: Search-to-Policy Learning for Step-wise Text-to-SQL Orchestration

The authors propose SQLConductor, a step-wise orchestration learning framework for Text-to-SQL that addresses the limitations of fixed pipelines and static plan-then-execute methods. This system formulates subtasks as specialized actions and trains a policy model to select the next action based on intermediate artifacts and feedback. To learn this policy, the framework introduces Search-to-Policy Learning, which utilizes Monte Carlo Tree Search to explore candidate workflows and stability estimation to identify robust supervision. The policy model is trained using Stability-weighted Supervised Fine-tuning to prioritize high-quality orchestration patterns and further enhanced through Curriculum Reinforcement Learning. This approach transforms offline workflow search into a deployable policy for step-wise orchestration at inference time. Experiments on BIRD-Dev and out-of-distribution datasets show that SQLConductor achieves 73.2% execution accuracy, outperforming prior methods with comparable or larger backbones. The results demonstrate superior execution accuracy and strong generalization while coordinating frozen larger action models.