FoMoE introduces a system that partitions expert layers across workers to avoid full model replicas, reducing communication costs by up to 1.42x over efficient baselines and 45.44x over DDP. It achieves up to 1.4x throughput speedups via a skip-token mechanism and demonstrates stable routing, with projected benefits extending to 100B-scale models through system modeling.
FoMoE Breaks Full-Replica Barrier with Partitioned Expert Layers
from English