MELT and SALT: Multimodal Contrastive Learning for Earth Embeddings

MELT and SALT are multimodal contrastive learning models that use unpaired geospatial data to improve location embeddings. Both achieve performance equal to the best two-modality baseline across four tasks, but adding more modalities does not consistently boost results, indicating the location encoder's design is the primary performance limit. MELT offers more stable training and is better suited for future model scaling.