HDS introduces a multi-objective reinforcement learning framework for online data mixing in LLM pre-training. It achieves 44% fewer training iterations on The Pile benchmark and improves MMLU 0-shot performance by 7.2%, with consistent gains across other benchmarks.
Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning
from English