FlashMorph: Budget-Constrained Hybrid Layer Selection for Efficient Transformers

FlashMorph is a novel method for converting Transformer models into hybrid architectures that balance full-attention accuracy with linear-attention efficiency by optimizing layer selection as a budget-constrained subset problem. The approach constructs a morphable model with parallel attention branches and jointly optimizes layerwise gates on synthetic data to determine the optimal configuration.

FlashMorph formulates hybrid layer selection as a budget-constrained subset optimization problem rather than relying on heuristic strategies.
It equips each full-attention layer with a converted linear-attention branch and freezes model weights while optimizing layerwise gates.
A linearization regularization encourages reliance on linear attention for efficiency during the gate optimization process.
Learned gates are discretized under a preset full-attention budget to instantiate the final hybrid architecture.
The method employs standard logits distillation and long-context finetuning after instantiation.

FlashMorph discovers more effective hybrid configurations that preserve strong long-context recall and general benchmark performance while substantially reducing layer selection costs compared to existing methods.