FlashMorph is a novel method for converting Transformer models into hybrid architectures that balance full-attention accuracy with linear-attention efficiency by optimizing layer selection as a budget-constrained subset problem. The approach constructs a morphable model with parallel attention branches and jointly optimizes layerwise gates on synthetic data to determine the optimal configuration.
- FlashMorph formulates hybrid layer selection as a budget-constrained subset optimization problem rather than relying on heuristic strategies.
- It equips each full-attention layer with a converted linear-attention branch and freezes model weights while optimizing layerwise gates.
- A linearization regularization encourages reliance on linear attention for efficiency during the gate optimization process.
- Learned gates are discretized under a preset full-attention budget to instantiate the final hybrid architecture.
- The method employs standard logits distillation and long-context finetuning after instantiation.
FlashMorph discovers more effective hybrid configurations that preserve strong long-context recall and general benchmark performance while substantially reducing layer selection costs compared to existing methods.