Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers

Researchers propose a scalable framework for merging independently trained billion-parameter transformers using linear mode connectivity, addressing scalability limits in existing methods. The approach employs function-preserving weight transformations and a dual learning procedure where both models jointly optimize toward a shared linear interpolation path.

The method applies properly parameterized functionality-preserving weight transformations to align functionally equivalent solutions.
A dual learning procedure allows both models to jointly learn corresponding transformations toward a shared linear interpolation path.
Bidirectional optimization substantially reduces interpolation barriers, enabling reliable merging across large-scale architectures.
Near-zero loss barriers are achieved on WikiText for medium-sized language models, marking the first demonstration of near-barrier-free linear connectivity at this scale.
ViT-L maintains above 69% ImageNet top-1 accuracy throughout the interpolation path in the vision domain.
Modern billion-parameter LLMs exhibit only small loss barriers when parameter symmetries are properly resolved.

Resolving parameter symmetries enables large pretrained Transformers to be connected and merged through simple linear paths with substantially improved interpolation performance.