The authors propose Review Residuals, a mechanism that scales each sublayer's update by a learned, input-dependent gate conditioned on both the current state and the proposed update. This approach aims to evaluate update reliability before committing it, addressing the limitation of standard residual connections that always add updates with a fixed coefficient.
- The gating function uses the sigmoid of weights applied to RMSNorm of both the previous hidden state and the proposed update.
- A convex (Highway-style) gate form causes vanishing gradients beyond ~20 layers, while the additive form trains stably at all tested depths.
- Models trained from scratch across five sizes (60M-1B parameters) show no advantage at small scales.
- At 590M parameters, Review Residuals significantly outperform parameter-matched Highway gates and standard residuals (p<0.05).
- The performance benefit increases with model size, showing a larger advantage at the 1B scale.
The authors consider this significant because it provides a stable, scalable improvement over standard residual connections that emerges only at larger model scales.