The paper introduces ScaleToT, a method that learns structured reasoning from a small subset of users and extends it to billions of low-activity users with sparse profiles. It combines a bounded entropy-guided Tree-of-Thought refinement with supervised fine-tuning and reward policy optimization to transfer reasoning capabilities without full LLM inference.

  • Constructs typed user-state chains using a bounded entropy-guided Tree-of-Thought (ToT) refinement procedure.
  • Trains a student model on static profiles via supervised fine-tuning (SFT) and Outcome-Driven Segment-Aware Implicit Reward Policy Optimization (OSIPO).
  • Transfers reasoning representations to a lightweight profile encoder to provide shared signals for the remaining users.
  • Evaluated on lifetime value (LTV) prediction in a billion-scale advertising deployment, covering only 7.32% of the population offline.

ScaleToT increases LT30 by 6.738% in online A/B tests while significantly reducing compute costs compared to full-population reasoning.