A developer outlines a plan to rebuild the Gemma 4 31B model by reducing its parameter count to approximately 26B while aiming for improved performance. The project involves architectural changes, specific training techniques, and dataset curation to create a smaller, more efficient model.
- Remove Layer 3, identified as the weakest of the five sliding window attention (SWA) layers.
- Rescale SWA attention spans to 1024/2048/4096/8.1k tokens followed by a global layer.
- Implement "Attention based Residual Networks" in global layers to improve information flow and global coherence.
- Use TopK (12 or 20) logits from the original model as targets for retraining while freezing the top and bottom of the network.
- Reduce total parameters from ~30.81B to ~26.02B through these structural modifications.
The author intends to achieve better long-context capabilities and overall performance in a smaller footprint, with plans to potentially uncensor the model's "thinking" training phase.