Dual-Learned Matching Enables Linear Mode Connectivity for Billion-Parameter Transformers

Researchers propose a scalable framework to enable linear mode connectivity-based merging for billion-parameter pretrained transformers. Existing methods typically optimize interpolation paths from only one model endpoint, limiting scalability for large architectures. The new approach applies parameterized weight transformations to align functionally equivalent solutions and uses a dual learning procedure where both models jointly learn transformations toward a shared path. This bidirectional optimization substantially reduces interpolation barriers and improves merging reliability across large-scale models. Empirically, the method achieves near-zero loss barriers on WikiText for medium-sized language models. In vision tasks, ViT-L maintains above 69% ImageNet top-1 accuracy throughout the interpolation path. Modern billion-parameter LLMs exhibit only small loss barriers using this technique.