Smooth Scaling Laws Hide Stepwise Token Learning

This study presents a token-level framework that decomposes language model scaling laws into localized learning events of individual contextualized tokens, challenging the view that heavy-tailed pattern difficulty is the sole cause.

The authors fit token loss trajectories with sigmoids to show that learning is concentrated in localized transitions, creating a learning-time spectrum that dominates the scaling-law shape.
Across more than 100 pre-training runs on large corpora with models up to 6B parameters and 300B tokens, this spectrum quantitatively reconstructs the validation loss derivative along training-step, data-scale, and model-scale axes.
Reshaping the training distribution based on when tokens become learnable alters the optimization trajectory, achieving an 11% faster reduction in validation loss.

These results provide direct empirical evidence that scaling laws are governed by the distribution of token-level learning times, demonstrating that this distribution can be used to explain scaling behavior and improve training performance.