Rebuilding Gemma 4 31b... better... As 26b...

A developer outlines a plan to rebuild the Gemma 4 31B model by reducing its parameter count to approximately 26B while aiming for improved performance. The project involves architectural changes, specific training techniques, and dataset curation to create a smaller, more efficient model.

Remove Layer 3, identified as the weakest of the five sliding window attention (SWA) layers.
Rescale SWA attention spans to 1024/2048/4096/8.1k tokens followed by a global layer.
Implement "Attention based Residual Networks" in global layers to improve information flow and global coherence.
Use TopK (12 or 20) logits from the original model as targets for retraining while freezing the top and bottom of the network.
Reduce total parameters from ~30.81B to ~26.02B through these structural modifications.

The author intends to achieve better long-context capabilities and overall performance in a smaller footprint, with plans to potentially uncensor the model's "thinking" training phase.