Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers
Researchers propose a scalable framework for merging independently trained billion-parameter transformers using linear mode connectivity, addressing scalability limits in existing methods. The approach employs function-preserving weight transformations and a dual learning procedure where both models jointly optimize toward a shared linear interpolation path.