Researchers propose the state-prediction separation hypothesis, arguing that disentangling next-token prediction from state storage yields better language modeling performance. They designed a Transformer variant using two computation streams to separate these functions and conducted pretraining experiments across various scales.

  • The proposed architecture consistently offers better data and compute efficiencies compared to standard Transformers.
  • It improves validation loss during pretraining.
  • It outperforms standard Transformers by 2--3 percentage points on average on downstream tasks.
  • Empirical analysis rules out confounders and demonstrates fundamental differences in the gradients entailed by this design.

The authors consider this significant as it provides a method to enhance model performance through architectural separation of computational roles.