The CARVE architecture addresses three critical defects in the leading GDN-2 delta-rule recurrent model by restricting erase operations to the key axis, thereby enabling valid WY-form triangular chunk solving and improving value efficiency. By reusing the recurrent output tensor as a content signal and replacing per-value write-gate projections with single scalars, CARVE maintains bit-identical initialization to GDN-2 while resolving memory-blind gating issues.

  • Achieves WikiText perplexity of 15.72 at 1.3B parameters trained on 100B tokens, outperforming GDN-2 by 4.5-sigma.
  • Leads all recurrent baselines on nine common-sense reasoning benchmarks and sets state-of-the-art results on every RULER retrieval probe.
  • Reduces peak memory usage by 13% and parameter count by 19% with only 0.4% throughput overhead.
  • Supported by six formal theorems covering memory capacity, Lyapunov stability, gradient flow, expressivity separation, Pareto-optimal chunk size, and hybrid optimality.

This approach allows recurrent models to remain competitive with Transformers in training efficiency while significantly improving performance on retrieval and reasoning tasks through mathematically grounded architectural changes.