Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

Researchers introduce the Holistic Data Scheduler (HDS), a novel online data mixing framework that addresses the limitations of existing methods by considering dynamic data composition from multiple dimensions. HDS formulates data scheduling as a reinforcement learning problem using the Soft Actor-Critic algorithm and a multi-objective reward function.

HDS utilizes a multi-objective, holistic reward function integrating data-driven quality, loss-driven inter-domain influence, and model-driven weight norms.
The framework employs the Soft Actor-Critic (SAC) algorithm for stability and sample efficiency in exploring high-dimensional policy spaces.
On The Pile benchmark, HDS achieves the final validation perplexity of the next best method with 44% fewer training iterations.
The model demonstrates a 7.2% improvement on the MMLU 0-shot task along with consistent gains on other benchmarks.

This approach enhances both training efficiency and final model capability by optimizing data mixtures through a comprehensive, multi-perspective reward system rather than a singular optimization perspective.

Benchmarks