SoftMoE replaces discrete top-k routing with a differentiable soft top-k LapSum relaxation, enabling gradient-based optimization of expert selection. It learns to allocate expert activation non-uniformly across layers, with later layers activating more experts, while using significantly fewer experts than traditional sparse MoE.
SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs
from English