FlashMorph is a novel method for converting Transformer models into hybrid architectures that balance full-attention accuracy with linear-attention efficiency by optimizing layer selection as a budget-constrained subset problem. The approach constructs a morphable model with parallel attention branches and jointly optimizes layerwise gates on synthetic data to determine the optimal configuration.

  • FlashMorph formulates hybrid layer selection as a budget-constrained subset optimization problem rather than relying on heuristic strategies.
  • It equips each full-attention layer with a converted linear-attention branch and freezes model weights while optimizing layerwise gates.
  • A linearization regularization encourages reliance on linear attention for efficiency during the gate optimization process.
  • Learned gates are discretized under a preset full-attention budget to instantiate the final hybrid architecture.
  • The method employs standard logits distillation and long-context finetuning after instantiation.

FlashMorph discovers more effective hybrid configurations that preserve strong long-context recall and general benchmark performance while substantially reducing layer selection costs compared to existing methods.