The paper introduces SQLConductor, a step-wise orchestration learning framework for Text-to-SQL that formulates subtasks as specialized actions and trains a policy model to select the next action based on intermediate artifacts and feedback.
- Utilizes Search-to-Policy Learning with Monte Carlo Tree Search to explore candidate workflows and stability estimation to identify robust supervision.
- Trains the policy model using Stability-weighted Supervised Fine-tuning to prioritize high-quality orchestration patterns.
- Enhances the policy through Curriculum Reinforcement Learning to transform offline workflow search into a deployable inference-time policy.
- Achieves 73.2% EX on BIRD-Dev with a compact orchestration policy coordinating frozen larger action models, outperforming prior methods.
The approach allows for adaptive orchestration to diverse query demands and demonstrates superior execution accuracy and strong generalization compared to existing systems.