Researchers propose CausalMix, a method that casts large language model data mixture optimization as a causal inference problem to address the limitations of static distribution assumptions in existing methods. The approach formulates statistical features as covariates and domain mixtures as treatments, estimating Conditional Average Treatment Effect (CATE) from 512 runs of Qwen2.5-0.5B to extrapolate optimal mixtures for larger models.

  • CausalMix dynamically infers state-dependent optimal data mixtures by leveraging causal modeling to isolate confounding biases.
  • The framework successfully generalizes to long chain-of-thought data on Qwen3-4B-Base.
  • Experiments show the mixture guided by CausalMix consistently improves performance across multiple downstream tasks, outperforming RegMix and other baselines.
  • A CATE Interpreter is provided for visual analysis of the learned mixing strategy.

CausalMix offers a causal and interpretable framework for optimizing LLM data mixtures, allowing seamless scaling from small settings to larger data pools without costly retraining.