Training methods
arxiv arXiv cs.LG · 8d ago

Recursive Masked Diffusion Models Introduce New Scaling Axis

Recursive Masked Diffusion Models (R-MDMs) introduce recursive depth as a third scaling axis by reapplying a denoising transformer within each diffusion step. This recursion enables iterative output refinement without increasing parameter count, achieving performance comparable to non-recursive models with up to L times more parameters, where L is the number of iterations. R-MDMs also reduce inference compute by partially replacing denoising steps with recursive refinement.

arxiv arXiv cs.LG · 8d ago

Volterra Generative Models Introduce Fractional Noise for Score-Based Generation

Volterra generative models propose a continuous-time score-based framework using fractional kernels to inject path-dependent noise, avoiding memoryless noising in traditional diffusion models. The approach introduces finite-dimensional Markovian lifts and proves squared error bounds, demonstrating improved generation on MNIST and potential for natural images, with a bridge sampler enhancing stability for larger models.

arxiv arXiv cs.LG · 8d ago

Edge Flow: A Continuous-Time Model for Gradient Descent at Edge of Stability

Edge Flow is a tractable, predictive continuous-time model that captures gradient descent dynamics at the edge of stability. It decomposes dynamics into center, oscillation direction, and magnitude, with self-stabilization of sharpness emerging from coupled feedback. The model requires only two gradient evaluations and one Hessian-vector product per iteration and outperforms prior models in tracking oscillations and explaining instabilities at EoS.

arxiv arXiv cs.LG · 8d ago

Compositional Generalization in Language Model Reasoning

A hierarchical latent selection model shows that supervised fine-tuning and reinforcement learning work together to enable compositional generalization in language models. SFT provides raw module materials, while RL identifies and recombines atomic modules from compound traces to solve new problems. Training on compound traces leads to stronger generalization than isolated module training, and an effective protocol is found where SFT ensures module coverage and RL drives exploration of novel compositions.